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Abstract 

We  consider  the  problem  of  efficiently  estimating  gradients  from  stochastic  simulation.  Although  the 
primary  motivation  is  their  use  in  simulation  optimization,  the  resulting  estimators  can  also  be  useful 
in  other  ways,  e.g.,  sensitivity  analysis.  The  main  approaches  described  are  finite  differences  (including 
simultaneous  perturbations),  perturbation  analysis,  the  likelihood  ratio/score  function  method,  and  the 
use  of  weak  derivatives. 


1  Introduction 

For  optimization  problems  with  continuous-valued  controllable  parameters,  the  availability  of  gradients  is 
clearly  helpful  in  obtaining  improved  solutions  based  on  an  iterative  scheme,  and  can  play  a  critical  role  in 
making  a  particular  problem  tractable.  This  is  true  in  stochastic  optimization,  as  well,  especially  for  such 
problems  based  on  an  underlying  simulation  model.  However,  in  this  stochastic  setting,  since  the  outputs  are 
themselves  random,  finding  or  deriving  stochastic  gradient  estimators  can  itself  be  a  challenging  problem. 

We  write  the  general  simulation  optimization  problem  as  follows: 

minJ(0),  (1) 

ee  e 

where  9  £  0  is  the  controllable  parameter  and  0  C  7Zd  is  the  feasible  region,  i.e.,  9  is  a  d-dimensional  vector. 

We  describe  three  examples  that  will  be  for  illustrative  purposes.  The  first  example  is  a  stochastic  activity 
network.  The  input  random  variables  are  the  individual  activity  times  X\,  X%, . . . ,  Xd,  and  the  objective 
function  is  the  total  project  duration.  The  parameters  are  the  mean  activity  times,  i.e.,  9  =  (9\, . . .  ,0d), 
where  9i  is  the  mean  of  the  itli  activity,  and  the  dimension  of  the  parameter  vector  is  equal  to  the  number 
of  activities.  The  optimization  problem  is  to  minimize  the  expected  project  duration  as  a  function  of  the 
mean  activity  times,  where  a  cost  is  attached  to  each  choice  of  mean  activity  time. 

The  second  example  is  the  familiar  first-come,  first-served  (FCFS),  single-server  queue,  with  the  input 
random  variables  being  the  interarrival  and  service  times,  and  the  output  performance  measure  being  time 
spent  in  the  system.  The  objective  function  includes  a  cost  on  the  service  rate,  and  the  optimization  problem 
is 

min  E\T(9)\  +  c/0,  (2) 

fe(o,i/A) 

where  c  is  the  service  rate  cost,  T(9)  is  the  average  system  time,  9  is  the  mean  service  time  (hence  9  is 
scalar),  and  A  is  the  arrival  rate.  Usually  the  problem  is  considered  in  steady  state,  and  often  specializing  to 
the  M/M/1  queue,  because  this  leads  to  an  analytically  tractable  solution  that  can  be  compared  with  that 
obtained  using  the  simulation  optimization  algorithm. 

The  third  example  is  an  (s,  S)  inventory  control  system.  Here,  the  input  random  variables  are  related 
to  demand  (amount  and  possibly  timing),  with  the  objective  function  being  a  total  cost  function  associated 
with  inventory  levels  and  order  amounts,  and  the  optimization  problem  being  to  minimize  expected  total 
cost  by  the  selection  of  the  inventory  control  parameters  s  and  S,  i.e.,  9  =  (s,  S)  is  a  two-dimensional  vector. 

Although  the  main  application  of  gradient  estimation  emphasized  in  this  paper  is  simulation-based  op¬ 
timization,  derivative  estimation  has  other  important  applications  in  simulation,  most  notably  sensitivity 
analysis.  This  can  be  useful  in  many  different  contexts,  e.g.,  factor  screening  to  decide  which  factors  are 
the  most  critical,  and  hedging  of  financial  instruments  and  portfolios.  The  rest  of  this  paper  is  organized  as 
follows.  A  brief  overview  of  gradient-based  simulation  optimization  is  provided.  Then  the  main  approaches 
for  stochastic  gradient  estimation  are  developed  in  some  detail,  including  examples,  some  discussion  of  desir¬ 
able  theoretical  properties  and  computational  requirements.  Specific  applications  in  different  areas  are  then 

1  Tli  is  is  a  pre-print  version  of  Chapter  19  in  Handbooks  in  Operations  Research  and  Management  Science:  Simulation, 
S.G.  Henderson  and  B.L.  Nelson,  eds.,  Elsevier. 
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described.  Other  than  a  few  exceptions  —  such  as  acknowledging  a  specific  key  result  —  references  to  the 
literature  will  be  provided  in  (deferred  to)  Sections  8  and  9  on  applications  and  probing  further. 

We  conclude  this  introduction  by  explicitly  stating  an  implicit  assumption  that  pervades  stochastic 
gradient  estimation  research  (as  well  as  much  of  the  research  reported  in  this  handbook). 

Key  Implicit  Assumption:  Each  estimate  of  J  at  a  given  9  is  expensive  to  generate. 

This  is  not  a  very  rigorous  statement,  as  it  is  not  even  mathematical,  but  the  gist  of  its  implication  is 
that  because  estimating  J  requires  a  nontrivial  amount  of  effort,  it  behooves  us  to  make  use  of  its  output 
more  efficiently.  The  costliness  can  be  due  to  a  number  of  reasons: 

(a)  It  is  expensive  to  generate  input  random  variables  {X;}  used  to  produce  an  estimate  of  J. 

(b)  A  lot  of  input  random  variables  need  to  be  generated,  either  because  each  estimate  involves  a  large 
number  of  input  random  variables,  or  because  a  lot  of  estimates  (simulation  replications)  need  to  be 
generated  to  achieve  a  desired  level  of  precision. 

(c)  It  is  a  nontrivial  task  to  go  from  the  input  random  variables  {W}  to  the  estimate  of  J . 

In  most  stochastic  discrete-event  simulation  models  of  practical  interest,  both  (b)  and  (c)  are  true.  If  none 
of  these  conditions  hold,  then  the  simulation  user  should  probably  just  use  “brute  force”  finite  difference 
techniques,  which  are  described  in  Section  3. 


2  Gradient-Based  Simulation  Optimization 

The  two  main  approaches  for  conducting  simulation-based  optimization  can  be  roughly  characterized  as 
follows: 

-  Carry  out  all  of  the  simulations  first  —  generally  a  very  large  number  of  replications  —  and  then  store 
things  appropriately  (random  number  seeds,  the  random  numbers  themselves,  or  realizations  of  random 
variates  or  possibly  sample  paths),  converting  the  stochastic  problem  into  a  deterministic  problem  that 
is  based  on  a  large  enough  set  of  samples  to  well  approximate  the  desired  problem. 

-  Carry  out  a  relatively  small  set  of  simulations  and  iteratively  improve  upon  the  current  solution  (or 
set  of  solutions  in  a  population-based  approach)  until  a  sufficiently  good  solution  is  reached  or  found. 

Previously,  the  first  approach  might  have  been  hindered  for  large  problems  by  constraints  in  computer 
memory,  but  that  has  become  less  of  an  issue  over  the  past  decade,  making  it  a  more  attractive  option,  due 
to  its  conceptual  simplicity  and  ability  to  use  the  arsenal  of  deterministic  nonlinear  optimization  algorithms 
available.  Since  our  main  concern  is  not  deterministic  optimization,  we  will  not  delve  further  along  this  line; 
see  the  last  section  for  some  references  to  the  development  of  the  theory  and  application  of  this  approach, 
which  has  a  number  of  different  names:  sample  average  approximation,  sample  path  optimization,  and 
stochastic  counterpart.  We  note  that  the  gradient  estimates  discussed  here  can  also  be  incorporated  into, 
and  often  play  a  critical  role  in,  these  algorithms.  Often  this  involves  “freezing”  the  random  numbers,  to  be 
able  to  use  them  to  generate  a  value  of  the  performance  measure  at  any  value  of  the  parameter,  and  this 
value  is  treated  as  a  deterministic  quantity  rather  than  a  sample  estimate. 

The  second  approach  uses  the  stochastic  analog  of  gradient-based  optimization,  which  is  converted  to 
a  zero-finding  problem  (for  the  gradient  of  the  objective  function)  and  then  addressed  using  stochastic 
approximation,  which  we  now  describe. 

2.1  Stochastic  Approximation 

A  natural  adaptation  of  “steepest  descent”  in  deterministic  nonlinear  optimization  is  stochastic  approxima¬ 
tion  (SA),  an  iterative  update  scheme  on  the  parameter  that  takes  the  following  general  form  for  finding  a 
zero  of  the  objective  function  gradient: 


0n+i 


,V J(9r 


(3) 
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where  the  “hat”  notation  denotes  an  estimate  of  the  gradient  VJ(0„),  {an}  denotes  the  “gain”  (step-size 
multiplier)  sequence,  and  lie  denotes  a  projection  back  into  the  feasible  region  0  when  the  update  (3)  takes 
6  out  of  0.  Whereas  in  second-order  Newton-Raphson  schemes,  a  key  enhancer  is  to  use  the  inverse  Hessian 
to  estimate  the  optimal  step  size,  this  is  much  less  of  a  concern  in  SA,  especially  in  the  early  stages  of  the 
algorithm.  In  fact,  to  guarantee  almost  sure  (a.s.)  convergence,  the  gain  sequence  must  vanish  in  the  limit, 
but  not  too  quickly.  The  usual  condition  is  that 

^an  =  oo,  a2  <  oo. 

n  n 

However,  in  practice,  one  often  decreases  the  step  size  to  some  value  at  which  it  is  kept  constant.  The¬ 
oretically,  this  leads  to  weak  convergence  (in  distribution),  at  best,  which  might  be  unsatisfactory.  Note 
that  the  sequence  need  not  be  deterministic  either,  in  which  case  the  conditions  above  have  to  be  modified 
accordingly.  There  are  also  Central  Limit  Theorem  results  for  the  asymptotic  behavior  of  9n ,  but  these  are 
beyond  the  scope  of  this  work. 

When  V  J(9n)  is  an  unbiased  estimator  of  V  J(0„),  the  SA  algorithm  is  generally  referred  to  as  being  of 
the  Robbins-Monro  type,  whereas  if  VJ(0„)  is  only  asymptotically  unbiased,  e.g.,  using  a  finite  difference 
estimate  with  the  difference  going  to  zero  at  an  appropriate  rate,  then  the  algorithm  is  referred  to  being  of  the 
Kiefer- Wolfowitz  type.  The  Robbins-Monro  SA  algorithm  generally  has  a  canonical  asymptotic  convergence 
rate  of  n-1/2,  in  contrast  to  n-1/3  for  the  Kiefer- Wolfowitz  SA  algorithm,  which  takes  the  following  form  in 
the  scalar  parameter  case: 


9n+ 1  • —  ne 


J{9n~^~Cn)  J{@n  C-n) 
2  Cri 


(4) 


where  J  denotes  an  estimate  of  J  and  {cn}  denotes  a  difference  sequence  that  must  also  decrease  to  zero  at 
an  appropriate  rate  satisfying 


E 


^  OO} 


E  an/C^ 


<  00. 


Thus,  in  addition  to  having  a  slower  canonical  asymptotic  convergence  rate,  a  Kiefer- Wolfowitz  SA  algorithm 
involves  the  additional  selection  of  an  appropriate  difference  sequence.  In  certain  special  cases  involving 
common  random  numbers,  however,  the  best  n-1/2  rate  can  also  be  achieved  in  practice.  A  common  form 
for  the  two  sequences  is  an  =  an~l  for  some  positive  a  (harmonic  series)  and  cn  =  cn for  positive  c.  If 


J(9n  +  cn)  —  J(9n) 


(5) 


is  used  instead  for  the  gradient  estimator  in  the  Kiefer- Wolfowitz  SA  algorithm  (i.e.,  one-sided  forward 
difference  gradient  estimator),  then  cn  =  cn-1/4  is  commonly  used. 

Perhaps  one  of  the  key  drawbacks  of  using  an  SA  algorithm,  especially  for  the  new  or  inexperienced  user, 
is  the  sensitivity  of  the  early  “transient”  convergence  rate  to  the  choices  of  these  sequences.  For  example, 
if  the  common  sequences  just  mentioned  are  used,  then  the  behavior  will  depend  on  the  choices  of  a  and  c. 
If  a  is  too  small,  then  the  algorithm  will  “crawl”  towards  the  optimum,  even  at  the  1 / \fn.  asymptotic  rate. 
On  the  other  hand,  if  a  is  chosen  too  large,  then  extreme  oscillations  may  occur,  resulting  in  an  “unstable” 
progression.  As  far  as  we  know,  all  of  the  theoretical  results  relating  to  convergence  rate  of  these  types  of 
algorithms  are  asymptotic  results,  which  may  be  of  little  practical  use  in  some  applications  where  all  of  the 
action  occurs  in  a  relatively  short  time. 

Another  drawback  is  that  in  general  the  SA  algorithm  converges  only  to  a  local  optimum,  although  this 
can  also  be  strengthened  by  the  appropriate  introduction  of  “noise”  into  the  algorithm.  Finally,  in  some 
settings,  it  may  not  be  obvious  how  to  project  onto  the  feasible  region  0,  which  might  be  specified  indirectly 
(e.g.,  in  a  mathematical  programming  formulation)  and  possibly  involve  “noisy”  constraints  that  also  have  to 
be  estimated  along  with  the  objective  function.  These  are  practical  issues  that  still  have  not  been  adequately 
addressed  for  simulation  optimization. 
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3  Indirect  Gradient  Estimation 


We  divide  the  approaches  to  stochastic  gradient  estimation  into  two  main  categories  —  indirect  and  direct 
—  which  we  now  define.  An  indirect  gradient  estimator  usually  has  two  characteristics:  (i)  it  only  estimates 
an  approximation  of  the  true  gradient  value,  e.g.,  via  a  secant  approximation  in  the  scalar  case;  and  (ii) 
it  uses  only  function  evaluations  (performance  measure  output  samples)  from  the  original  (unmodified) 
system  of  interest.  A  direct  gradient  estimator  tries  to  estimate  the  true  gradient  using  some  additional 
analysis  of  the  underlying  stochastics  of  the  model.  More  specifically,  we  will  refer  to  the  indirect  gradient 
estimation  approach  as  one  in  which  the  simulation  output  is  treated  as  coming  out  of  a  given  black  box,  by 
which  we  mean  it  satisfies  two  assumptions:  (i)  no  knowledge  of  the  underlying  mechanics  of  the  simulation 
model  is  used  in  deriving  the  estimators,  such  as  knowing  the  input  probability  distributions;  and  (ii)  no 
changes  are  made  in  the  execution  of  the  simulation  model  itself,  such  as  changing  the  input  distribution 
for  importance  sampling.  Note  that  this  entails  satisfying  both  assumptions;  many  of  the  direct  gradient 
estimation  techniques  can  be  implemented,  without  changing  anything  in  the  underlying  simulation,  but  they 
may  require  some  knowledge  of  the  simulation  model,  such  as  the  input  distributions  or  some  of  the  system 
dynamics.  In  the  case  of  stochastic  simulation,  as  opposed  to  online  estimation  based  on  an  actual  system, 
it  could  be  argued  that  to  carry  out  the  simulation  most  of  these  mechanics  need  to  be  known,  i.e. ,  one 
cannot  carry  out  a  stochastic  simulation  without  specifying  the  input  distributions.  Here,  we  simply  use 
the  two  assumptions  to  distinguish  between  the  two  categories  of  approaches  and  not  to  debate  whether  an 
estimator  is  “model”  dependent  or  not.  In  terms  of  stochastic  approximation  algorithms,  indirect  and  direct 
gradient  estimators  generally  correspond  to  Kiefer- Wolfowitz  and  Robbins-Monro  algorithms,  respectively. 

We  describe  two  indirect  gradient  estimators:  finite  differences  and  simultaneous  perturbations.  Following 
our  definition,  these  approaches  require  no  knowledge  of  the  workings  of  the  simulation  model,  which  is 
treated  as  a  black  box. 

3.1  Finite  Differences 

The  “brute  force”  or  “naive”  method  for  estimating  a  gradient  at  6  is  simply  to  use  finite  differences,  i.e., 
perturbing  the  value  of  each  component  of  0  separately  while  holding  the  other  components  at  the  nominal 
value.  If  the  value  of  the  perturbation  is  too  small,  the  resulting  difference  estimator  could  be  extremely 
noisy,  because  the  output  is  stochastic;  hence  there  is  a  trade-off  between  bias  and  variance  in  making 
this  selection,  and  unless  all  components  of  the  parameter  vector  are  suitably  “standardized”  a  priori,  this 
choice  must  be  done  for  each  component  separately,  which  could  be  a  burdensome  task  for  high-dimensional 
problems.  If  the  goal  is  sensitivity  analysis  rather  than  optimization,  then  one  would  generally  err  on  the 
side  of  selecting  a  relatively  larger  value  for  the  perturbation. 

The  it li  component  of  the  one-sided  forward  difference  gradient  estimator  is  given  by 

J(6  +  Cjej)  -J{6) 

Ci 

where  c  is  the  vector  of  differences  (cj  the  perturbation  in  the  ith  direction)  and  e,  denotes  the  unit  vector 
in  the  ith  direction. 

The  'ttli  component  of  the  two-sided  symmetric  (or  central)  difference  gradient  estimator  is  given  by 

J  {9  T  CiCi)  ,](()  CiCi')  .  . 

2 7i  ’  1  ’ 

which  corresponds  to  the  estimator  used  in  the  original  Kiefer- Wolfowitz  SA  algorithm  given  by  (4). 

Again,  it  should  be  noted  that  in  stochastic  simulation,  using  common  random  numbers  can  reduce  the 
variance  of  the  gradient  estimators  substantially,  although  in  practice  synchronization  is  clearly  an  issue,  since 
merely  using  the  same  random  number  seeds  is  typically  not  effective.  The  symmetric  difference  estimator 
given  by  (7)  is  more  accurate,  but  it  requires  2d  objective  function  estimates  (simulation  replications)  per 
gradient  estimate,  as  opposed  to  d+ 1  function  estimates  (simulation  replications)  for  the  one-sided  estimator 
given  by  (6).  For  example,  in  the  stochastic  activity  network,  the  symmetric  difference  estimator  would 
involve  performing  simulations  by  varying  the  mean  of  each  activity  i  individually  by  ±Cj  while  holding  the 
other  activity  means  at  their  nominal  values,  i.e.,  at  parameter  settings  {0;  ±  q,  9j,j  ^  i},  for  i  =  1, . . . ,  d. 
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3.2  Simultaneous  Perturbations 


This  approach  has  the  advantage  that  the  number  of  simulation  replications  needed  to  form  an  estimator  of 
the  gradient  is  independent  of  the  dimension  of  the  parameter  vector. 

The  itli  component  of  the  simultaneous  perturbations  (SP)  gradient  estimator  is  given  by 


J{6  +  cA)  —  J(6  —  cA) 

2cjAj 


(8) 


where  A  =  (Ai, . . . ,  Ad)  is  a  d-dimensional  vector  of  perturbations,  which  are  generally  assumed  i.i.d.  as  a 
function  of  iteration  and  independent  across  components.  In  this  case,  c  is  again  the  set  of  differences  for 
each  component,  but  it  is  a  diagonal  matrix  with  the  differences  {c.;}  on  the  diagonal.  The  key  difference 
between  this  estimator  and  a  finite  difference  estimator  is  that  the  numerator  of  (8)  —  corresponding  to  a 
difference  in  the  function  estimates  —  is  the  same  for  all  components  (i.e.,  independent  of  *),  whereas  the 
numerator  in  the  symmetric  difference  estimator  given  by  (7)  involves  a  different  pair  of  function  estimates 
for  each  component  (i.e.,  is  a  function  of  i).  Thus,  the  full  gradient  estimator  requires  only  two  function 
estimates,  regardless  of  the  size  of  the  parameter  dimension  d.  However,  since  d  random  numbers  must  be 
generated  to  produce  the  perturbation  sequence  A  at  each  iteration,  if  generating  the  function  estimates  J 
is  relatively  “cheap,”  then  this  procedure  is  likely  to  be  inferior  to  the  previous  finite  difference  approaches. 
However,  our  key  implicit  assumption  is  that  function  estimates  are  expensive,  which  is  generally  true  in 
the  context  of  stochastic  discrete-event  simulation.  Furthermore,  this  procedure  may  also  be  of  benefit  in 
deterministic  situations  where  J  is  expensive  to  generate,  e.g.,  requires  a  computationally  intensive  finite 
element  analysis  program. 

The  key  requirement  on  the  perturbation  sequence  to  guarantee  a.s.  convergence  when  used  in  a  simulta¬ 
neous  perturbation  stochastic  approximation  (SPSA)  algorithm  is  that  each  term  have  mean  zero  and  finite 
inverse  second  moments.  Thus,  the  normal  (Gaussian)  distribution  is  prohibited,  and  the  most  commonly 
used  distribution  is  the  symmetric  Bernoulli,  whereby  the  perturbation  takes  the  positive  and  negative  (equal 
in  magnitude,  e.g.,  ±1)  value  w.p.  0.5.  Intuitively,  convergence  comes  about  from  the  averaging  property 
of  the  random  directions  selected  at  each  iteration,  i.e.,  in  the  long-run,  each  component  will  converge  to 
the  correct  gradient  even  if  at  any  particular  iteration  the  estimator  may  appear  odd.  This  estimator  was 
designed  for  optimization  via  stochastic  approximation,  and  is  of  limited  use  in  sensitivity  analysis,  although 
perhaps  averaging  over  a  large  number  of  replications  might  make  it  more  applicable. 

A  very  similar  gradient  estimator  for  use  in  SA  algorithms  is  the  random  directions  gradient  estimator, 
whose  ith  component  is  given  by 

(j(0  +  cA)  —  J(9  —  cA))  Aj 

A _ L _  (a\ 


Instead  of  dividing  by  the  perturbation  component,  the  difference  term  multiplies  the  component.  Thus, 
normal  distributions  can  be  used  for  the  perturbation  sequence,  and  convergence  requirements  translate  the 
moment  condition  to  a  bound  on  the  second  moment,  as  well  as  zero  mean.  Of  course,  a  correspondence  to 
the  SP  estimator  can  be  made  by  simply  taking  the  componentwise  inverse,  but  in  practice  the  performance 
of  the  two  resulting  SA  algorithms  differs  substantially. 

A  more  recent  development  in  the  application  of  SPSA  is  to  use  deterministic  sequences  for  the  per¬ 
turbation  sequences  {A}.  The  idea  is  analogous  to  the  use  of  quasi-Monte  Carlo,  whereby  the  gradient 
averaging  is  the  critical  factor.  There  are  also  relevant  connections  to  literature  in  the  design  of  experiments 
methodology.  However,  more  theoretical  work  explaining  their  effectiveness  and  more  numerical  examples 
establishing  their  advantage  over  stochastic  sequences  is  needed. 


4  Direct  Gradient  Estimation 

Although  indirect  gradient  estimation  offers  greater  generality,  direct  gradient  estimation  has  the  following 
advantages: 

•  It  usually  provides  an  unbiased  estimator,  which  leads  to  faster  convergence  rates  when  implemented 
in  a  simulation  optimization  algorithm,  e.g.,  stochastic  approximation. 
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•  It  eliminates  the  need  to  determine  appropriate  values  for  the  finite  difference  perturbations  —  c  in 
the  estimators  given  by  (4),  (5),  (6),  (7),  (8),  (9),  —  which  influences  the  accuracy  of  the  estimator. 
Smaller  values  lead  to  lower  bias  but  generally  at  the  cost  of  increased  variance,  to  the  point  of  possibly 
giving  the  wrong  sign  for  small  enough  values. 

•  The  resulting  estimators  are  usually  more  computationally  efficient.  This  is  almost  universally  true 
when  compared  to  high-dimensional  brute  force  finite  differences,  but  not  necessarily  the  case  when 
compared  with  SP  estimators.  When  used  in  stochastic  approximation,  this  can  result  in  faster  con¬ 
vergence  rates,  as  discussed  earlier. 

Potential  challenges  in  applying  direct  gradient  estimation  include  the  following: 

•  They  require  more  “off-line”  work,  which  might  be  as  simple  as  computing  some  derivatives  of  density 
functions,  but  could  involve  quite  a  bit  of  problem-specific  analysis  requiring  sophisticated  applied 
probability  tools. 

•  The  implementation  usually  requires  some  coding  inside  the  simulation  model,  and  sometimes  involves 
some  changes  in  the  way  the  simulation  is  actually  carried  out. 

For  expositional  purposes,  we  will  assume  that  the  objective  function  is  an  expectation,  specifically 

J{9)  =  E[Y(9)\  =  E[Y(XU  ...,XT)},  (10) 

where  Y  is  the  (univariate)  output  performance  measure,  {Xi}  are  the  input  random  variables,  and  T  is  a 
fixed  finite  number.  This  covers  many  types  of  performance  measures,  including  distributional  performance 
measures  such  as  the  tail  distribution,  handled  using  indicator  functions.  However,  the  “dual”  performance 
measure  involving  quantiles  (e.g.,  median),  and  also  measures  such  as  the  mode,  cannot  be  absorbed  into 
this  framework.  The  cases  where  T  is  random  (a  stopping  time)  or  as  T  goes  to  infinity  (steady  state),  can 
also  be  handled,  but  require  some  additional  technicalities. 

In  the  right-most  expression  for  the  performance  measure  given  in  (10),  the  dependence  on  9  has  not  been 
displayed,  because  where  9  appears  influences  which  direct  gradient  estimation  technique  is  most  applicable. 
In  particular,  we  distinguish  between  two  main  dependencies: 

•  sample  (pathwise); 

•  measure  (distributional). 

It  is  important  to  note  that  for  many  performance  measures  of  interest,  Y  can  be  expressed  such  that  either 
dependency  can  be  used.  We  will  see  examples  of  this  in  the  discussion  that  follows. 

To  derive  direct  gradient  estimators,  we  write  the  expectation  using  what  is  sometimes  called  the  law  of 
the  unconscious  statistician: 

£[T(X)]  =  J  ydFY(y)  =  J  Y(x)dFx(x).  (11) 

In  fact,  one  of  our  chief  views  of  stochastic  simulation  is  a  way  of  carrying  out  this  relationship,  i.e. ,  coming 
into  the  simulation  are  input  random  variables,  for  which  we  know  the  distribution;  coming  out  of  the 
simulation  are  output  random  variables,  for  which  we  would  like  to  know  the  distributions.  However,  what 
we  have  is  a  way  to  generate  samples  of  the  output  random  variables  as  a  function  of  the  input  random 
variables  via  the  simulation  model.  Of  course,  if  we  really  knew  the  distribution  of  the  sample  performance 
output  r.v.  Y,  then  there  would  probably  be  little  reason  to  have  to  simulate  the  system. 

For  simplicity  in  discussion,  we  will  assume  henceforth  that  the  parameter  9  is  scalar,  because  the  vector 
case  can  be  handled  by  taking  each  component  individually.  In  view  of  the  right-hand  side  of  (11),  we 
revisit  the  question  as  to  the  location  of  the  parameter  in  a  stochastic  setting.  Putting  it  in  the  sample 
performance  Y(-'9)  corresponds  to  the  view  of  perturbation  analysis  (PA),  whereas  if  it  is  absorbed  in  the 
distribution  F(-;  9),  then  the  approach  follows  that  of  the  likelihood  ratio  (LR)  method  (also  known  as  the 
score  function  (SF)  method)  or  weak  derivatives  (WD)  (also  known  as  measure- valued  differentiation).  In 
the  general  setting  where  the  parameter  is  a  vector,  it  is  possible  that  some  of  the  components  would  be 
most  naturally  located  in  the  sample  performance,  while  others  would  be  easily  retained  in  the  distributions, 
giving  rise  to  a  mixed  approach.  For  example,  in  the  (s,  S)  inventory  control  case  with  random  order  arrival 
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times,  it  might  be  most  effective  to  use  PA  for  the  control  parameters,  WD  for  demand  parameters,  and 
LR/SF  for  order  parameters. 

Naming  the  parameter  “Spot”  for  good  measure,  we  pose  a  question  and  offer  several  observations: 

(i)  Where  is  Spot? 

This  may  determine  which  approach  is  most  appropriate.  LR/SF  and  WD  estimators  consider  distri¬ 
butional  parameters,  so  for  them  the  parameter  should  appear  in  the  input  distribution(s). 

(ii)  Spot  can  move! 

For  the  same  system,  the  problem  may  be  formulated  such  that  Spot  appears  in  either  spot.  We  will 
demonstrate  this  in  the  examples.  In  the  first  two  examples,  it  is  simply  a  matter  of  interpretation  of 
the  underlying  stochastics  (probability  measures),  whereas  in  the  (s,  S)  inventory  system,  a  change  of 
variables  is  necessary. 

(iii)  Spot  can  make  repeat  appearances. 

For  example,  in  the  single-server  queue  example,  where  the  parameter  appears  in  the  service  time 
distribution,  the  parameter  must  be  considered  every  time  a  service  time  is  generated. 

(iv)  Spot  can  be  in  two  places  at  once! 

It  is  possible  that  the  parameter  appears  in  both  the  distribution  and  in  a  sample  pathwise  manner. 
Also,  the  parameter  could  appear  in  more  than  one  distribution  (as  opposed  to  being  in  the  same 
distribution  multiple  times,  as  in  the  previous  item). 

Let  /  denote  the  joint  p.cl.f.  of  all  of  the  input  random  variables.  Differentiating  (11),  and  assuming  an 
interchange  of  integration  and  differentiation  is  permissible,  we  write  two  cases: 


dE[Y{X)\ 

de 


Y  (x) 


df{x;  9) 
d9 


dY(X(6 ;  u)) 

de 


-dx 

(12) 

du , 

(13) 

where  x  and  u  and  the  integrals  are  all  T-dimensional.  For  notational  simplicity,  throughout  these  T- 
dimensional  multiple  integrals  are  written  as  a  single  integral,  and  we  also  assume  one  random  number  u 
produces  one  random  variate  x.  In  (12),  the  parameter  appears  in  the  distribution  directly,  whereas  in  (13), 
the  underlying  uncertainty  is  considered  the  uniform  random  numbers.  These  correspond  to  what  we  called 
the  distributional  (measure)  and  pathwise  (sample)  dependencies,  respectively. 

For  expositional  ease  in  introducing  the  approaches,  we  begin  by  assuming  that  the  parameter  only 
appears  in  X±,  which  is  generated  independently  of  the  other  input  random  variables.  So  for  the  case  of  (13), 
we  use  the  chain  rule  to  write 


dE[Y{X)\ 

de 


dYjx^e-  Ul),x2,...)) 

de 


du 


r1  dY  dx^e-m), 

L  m— —dv 


In  other  words,  the  estimator  takes  the  form 


(14) 


dY(X)  dX i 
dX1  df ’ 


(15) 


where  the  parameter  appears  in  the  transformation  from  random  number  to  random  variate,  and  the  deriva¬ 
tive  is  expressed  as  the  product  of  a  sample  path  derivative  and  derivative  of  a  random  variable.  The  issue 
of  what  constitutes  the  latter  will  be  taken  up  shortly,  but  this  approach  is  called  infinitesimal  perturbation 
analysis  (IPA).  For  the  M/M/1  queue,  the  sample  path  derivative  could  be  derived  using  Lindley’s  equation, 
relating  the  time  in  system  of  a  customer  to  the  service  times  (and  interarrival  times,  which  are  not  a  function 
of  the  parameter). 
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Assume  that  X\  has  marginal  p.cl.f.  fi(-;8)  and  that  the  joint  p.d.f.  for  the  remaining  input  random 
variables  (X2,  ■  ■  ■)  is  given  by  /_  1,  which  has  no  (functional)  dependence  on  8.  Then  the  assumed  inde¬ 
pendence  gives  /  =  fif—i,  and  the  expression  (12)  involving  differentiation  of  a  density  (measure)  can  be 
further  manipulated  using  the  product  rule  of  differentiation  to  yield  the  following: 


dE[Y{X)] 

d.0 


Y(x) 
Y(  x) 


dfi{xi;6) 

88 

5 In  fi(xi\  8) 

88 


f- l(X2,  ■  ■  ■ )dx 
f{x)dx. 


In  other  words,  the  estimator  takes  the  form 

^dln  f^XuO) 

y  W - - 


(16) 

(17) 


(18) 


Since  the  term  is  the  well-known  (efficient)  score  function  in  statistics,  this  approach  has  been 

called  the  score  function  (SF)  method.  The  other  name  given  to  this  approach  —  the  likelihood  ratio  (LR) 
method  —  comes  from  the  closely  related  likelihood  ratio  function  given  by 


AM) 
AG;  0o )  ’ 


which  when  differentiated  with  respect  to  8  gives 


dfi{-,8)/88 

AG;0q) 


which  is  equal  to  the  score  function  upon  setting  8q  =  8. 

On  the  other  hand,  if  instead  of  expressing  the  right-hand  side  of  (16)  as  (17),  the  density  derivative  is 
written  as 

m^’e)  =  c(8)  i;0)  -  /i(1)Ori;0))  , 

it  leads  to  the  following  relationship: 


dE[Y(X)} 

d8 


a  00  /»oo 

Y(x)f[2\x1;8)f-i(x2,.-.)dx-  /  Y(x)f[1)  (xr,  8)f-i(x2,  •  •  -)da 

-OO  J  —  OO 


The  triple  (c(8),  constitutes  a  weak  derivative  (WD)  for  /j,  which  is  in  general  not  unique,  as  we 

shall  shortly  see.  The  corresponding  WD  estimator  is  of  the  form 


c(8)  (Y(X[2\X2, ...)-  Y(X[1\X2, ...)), 


(19) 


where  X j1"*  ~  /j 1 1  and  X[2^  ~  /j2\  The  term  “weak”  derivative  comes  about  from  the  possibility  that 
(16)  may  not  be  proper,  but  its  integral  may  be  well-defined,  e.g.,  it  might  involve  delta-functions 
(impulses),  corresponding  to  mass  functions  of  discrete  distributions. 

If  in  the  expression  (13),  the  interchange  of  expectation  and  differentiation  does  not  hold  (e.g.,  if  Y  is  an 
indicator  function),  then  as  long  as  there  is  more  than  one  input  random  variable,  appropriate  conditioning 
will  often  allow  the  interchange  as  follows: 


dE[Y(X)]  f1  dE[Y(X(9;  u))\Z(u)] 

d8  JQ  d8 


(20) 


where  Z  C  {X\ , . . . ,  Xt}-  This  conditioning  is  known  as  smoothed  perturbation  analysis  (SPA),  because  it 
is  intended  to  “smooth”  out  a  discontinuous  function.  It  leads  to  an  estimator  of  the  following  form: 


8E[Y{X)\Z]  dX  1 
dXl  df’ 


(21) 


Note  that  taking  Z  as  the  entire  set  leads  back  to  (15). 

Remark  For  SPA,  the  conditioning  in  (20)  was  done  with  respect  to  a  subset  of  the  input  random  variables 
only.  Further  conditioning  can  be  done  on  events  in  the  system,  which  leads  to  an  estimator  of  the  following 
general  form: 

5 +  (22) 

where  the  subscript  indicates  a  corresponding  conditional  expectation/probability,  B  is  an  appropriately 
selected  event,  and  SY  represents  a  change  in  the  performance  measure  under  the  conditioned  (usually 
called  “critical”)  event.  In  this  case,  if  the  probability  rate  dP^ ^  is  0,  the  estimator  (22)  also  reduces  to 
IPA.  On  the  other  hand,  if  the  IPA  term  is  zero,  the  estimator  may  coincide  with  the  WD  estimator  in 
certain  cases,  with  correspondences  between  c(9)  and  the  probability  rate,  and  between  the  difference  term 
in  (19)  and  the  conditional  expectation  in  (22). 

4.1  Desirable  Properties 

Direct  gradient  estimators  are  estimators  like  any  other  estimators;  they  just  happen  to  be  estimating 
derivatives.  Thus,  like  other  estimators,  we  would  like  them  to  be  unbiased. 

Definition.  The  stochastic  gradient  estimator  VJ(0)  is  unbiased  if  E[XJ(9)]  =  X7gJ(9). 

We  would  also  like  the  estimators  to  have  low  variance.  In  other  words,  we  want  them  to  be  both  correct 
and  precise,  as  opposed  to  being  wrong  and  noisy.  Strong  consistency  is  another  desirable  property.  For 
finite  horizon  performance  measures,  we  simply  invoke  the  strong  law  of  large  numbers.  For  infinite  horizon 
or  steady-state  performance  measures,  the  usual  ergodicity  considerations  come  into  play,  and  just  as  in 
steady-state  output  analysis  methodology,  this  can  involve  a  lot  of  theoretical  tools  from  applied  probability. 

Basically,  unbiasedness  requires  the  exchange  of  the  operations  of  differentiation  (limit)  and  integration 
(expectation),  as  was  assumed  in  deriving  (12)  and  (13).  Mathematically,  this  boils  down  to  whether  or  not 
the  dominated  convergence  theorem  can  be  applied  (see  Section  6).  In  the  case  of  PA,  the  bounding  involves 
properties  of  the  performance  measure,  whereas  in  LR/SF  and  WD,  the  bounding  involves  the  distribution 
functions.  In  either  case,  the  primary  difficulty  is  in  being  able  to  implement  the  derivative/gradient,  just 
as  building  a  simulation  model  is  a  non-trivial  task  in  implementation,  although  conceptually  there  may  be 
little  difficulty. 

The  applicability  of  IPA  may  depend  on  how  the  input  processes  are  constructed/generated  (see  the  next 
section).  In  applying  SPA,  there  is  the  choice  of  conditioning  quantities  (cf.  (20) / (21)),  which  affects  how 
easily  the  resulting  conditional  expectation  can  be  estimated  from  sample  paths.  There  are  also  a  multitude 
of  choices  of  WD  triples  for  a  given  input  distribution,  and  this  determines  both  the  amount  of  additional 
simulation  required  and  the  variance  of  the  resulting  WD  output  gradient  estimator.  For  LR/SF  estimators, 
the  variance  of  the  estimator  could  also  be  a  problem  if  care  is  not  taken  in  implementation,  e.g.,  the  naive 
estimator  will  lead  to  a  linear  increase  in  variance  with  respect  to  the  simulation  horizon. 

4.2  Derivatives  of  Random  Variables 

PA  estimators  —  e.g.,  those  shown  in  (15),  (21),  (22)  -  require  the  notion  of  derivatives  of  random  variables. 
The  mathematical  problem  for  defining  such  derivatives  consists  of  constructing  a  family  of  random  variables 
parameterized  by  9  on  a  common  probability  space,  with  the  point  of  departure  being  a  set  of  parameterized 
distribution  functions  {.F(-;0)}.  We  wish  to  construct  X{9)  ~  F(-;9)  s.t.  V0  £  0,  X(9 )  is  differentiable 
w.p.l.  The  sample  derivative  is  then  defined  in  the  intuitive  manner  as 


dX(9,to)  X(9 

- - —  =  hm  - 

d9  A#— >o 


A 6,  to)  —  X(9,  to) 

A 9  ' 


where  to  denotes  a  sample  point  in  the  underlying  probability  space.  If  the  distribution  of  X  is  known,  we 
have 

dX(9)  _  dF(X-9)/d9 


d9 


dF{X-,9)/dX ’ 
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where  we  use  the  (slightly  abusive)  notation 


x=X 


dF(X;  8)  dF{x-,  8) 
dX  "  dx 

Definition.  For  a  distribution  function  F(x\0 ),  6  is  said  to  be  a  location  parameter  if  F(x  +  9;  8)  does 
not  depend  on  8\  9  is  said  to  be  a  scale  parameter  if  F(x9;  8)  does  not  depend  on  9\  and  8  is  said  to 

be  a  generalized  scale  parameter  if  F{9  +  x9\  8)  does  not  depend  on  9,  for  some  fixed  8  (usually  a  location 
parameter)  not  dependent  on  9. 

In  these  special  cases,  one  can  use  the  following  sample  derivatives  for  the  three  respective  cases  (location, 
scale,  generalized  scale): 

dX  dX  _  X  dX  _  X  -9 

~dF~  ’  ~dB~T'  ~dB  ~  8 

The  most  well-known  example  is  the  normal  distribution,  with  the  mean  being  a  location  parameter  and 
the  standard  deviation  a  generalized  scale  parameter.  Similarly,  the  two  parameters  in  the  Cauchy,  Gumbel 
(extreme  value),  and  logistic  distributions  also  correspond  to  location  and  generalized  scale  parameters. 
Other  examples  include  the  mean  of  the  exponential  being  a  scale  parameter;  and  the  mean  of  the  uniform 
distribution  being  a  location  parameter,  with  the  half-width  being  a  generalized  scale  parameter.  In  the 
special  case  U(0,9),  the  single  parameter  is  an  ordinary  scale  parameter.  Also,  for  N(9,  (8a)2),  8  is  an 
ordinary  scale  parameter. 

Lastly,  note  that  for  a  given  distribution,  there  may  be  multiple  ways  to  generate  a  random  variate,  which 
leads  to  different  derivatives,  some  of  which  may  be  unbiased  and  some  of  which  may  not.  This  is  called  the 
role  of  representations,  which  we  illustrate  with  a  simple  example. 


Example  1  Let 


/  17(1,2)  w.p.  0,  9  g  (0,1), 

\  17(0,1)  w.p.  1  —  9, 


a  mixture  of  two  uniform  distributions,  with  dE[X]/dd  =  1.  A  straightforward  construction/representation 
using  two  random  numbers  is 


A  =  l{t/i  <  9}{  1  +  U2)  +  1{U\  >  8}U2, 


(24) 


where  U\,  U2  ~  U{ 0, 1)  are  independent  and  1  { - }  denotes  the  indicator  function.  However,  since 


dX 

~d8 


0  w.p.l, 


this  clearly  leads  to  a  biased  estimator.  Note  that  viewed  as  a  function  of  8,  X  jumps  from  U2  to  1  +  U2  at 
9  =  U\.  However,  an  unbiased  estimator  can  be  obtained  by  using  the  following  construction  in  which  the 
“coin  flipping”  and  uniform  generation  are  correlated: 


A  =  1  {U  <  9}  ^1+  9  QU^j  +1{U  >  8}  ’  where  U  ~  C/(°’1)’ 

which  is  unbiased  (has  expectation  equal  to  dE[X\/dd  =  1).  This  construction  is  based  on  the  property 
that  the  distributions  of  the  random  variable  ( 8  —  U)/8  under  the  condition  {U  <  6}  and  the  random 
variable  (1  —  U)/(  1  —  8)  under  the  condition  {U  >  8}  are  both  f7(0,l).  From  a  simulation  perspective, 
this  representation  has  the  additional  advantage  of  requiring  only  a  single  random  number  to  generate  A 
instead  of  two  as  in  the  previous  construction.  In  terms  of  the  derivative,  the  crucial  property  is  that  A  is 
continuous  across  8  =  U.  One  can  easily  construct  other  single  random  number  representations  that  do  not 
have  this  desirable  characteristic,  e.g., 


A  =  1  {U<  9}(  1  +  j)  +  1  {U  >  8}lj-^ 


where  U  ~  t/(0, 1), 


dX 

~d8 


l{U<8}-^-  +  l{U>8} 


1  -  U  ' 

(1-0)2J  ’ 
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which  is  biased  (has  expectation  -1),  the  intuitive  reason  being  the  discontinuity  of  X  at  U  =  9. 

For  the  first  representation,  where  the  “natural”  construction  leads  to  a  biased  estimator,  we  demon¬ 
strate  the  use  of  conditioning  (SPA)  to  rectify  the  situation.  First,  we  derive  an  alternative  estimator  by 
conditioning  on  U2,  i.e. ,  let 


X  =  E[X1  \U2]  =  (1  +  U2)9  +  U2{  1  -O)  =  0+U2, 

leading  to  the  obviously  unbiased  estimator  dX/d6  =  1. 

Alternatively,  consider  the  event 


B{A9)  =  {C/i  G  (9,0  +  A0]},  A9  >  0. 


Intuitively,  this  event  corresponds  to  those  samples  such  that  a  perturbation  A 9  >  0  causes  a  “jump”  in  the 
sample  performance  from  U2  (when  U±  >  9)  to  1  +  U2  (when  U\  <  9  +  A O').  The  complement  of  this  event 
corresponds  to  the  “smooth”  case  of  IPA.  Thus,  we  write  for  the  representation  defined  by  (24): 


dE[X] 

d.9 


lim 

AO^O 

lim 

A8^0 


E[X(0  +  A0)  -  X(6)} 

A  9 

E[(X(9  +  A 9)  -  X(6»))1(SC(A6»))] 


E 


lim 
A0—  0 

dX' 


d9 


A9 

E[(X(0  +  A 9)  -  X(0))1(B(A0))] 

A  9 

lim  E[X(9  +  A9)  -  X(9)\B(A9)]  lim 
A8^0  L  A8^0  A  9 

A  9 


lim  (2£[(1  +  U2)  -  U2\)  lim  —  =  1, 
a#— >0 v  Ae^o  A9 


(25) 


where  Bc  denotes  the  complement  of  B.  In  this  case,  because  we  can  evaluate  the  difference  (1  +  U2)  —  U2 
analytically,  the  final  “estimator”  is  a  deterministic  quantity  equal  to  the  value  to  be  estimated.  In  more 
complicated  systems,  the  challenge  is  usually  in  estimating  this  difference  of  the  sample  performance  under 
two  different  conditions  on  the  sample  path,  given  by  the  limiting  conditional  expectation  in  (25). 


4.3  Derivatives  of  Measures 

As  we  have  seen  already,  both  the  LR/SF  and  WD  estimators  rely  on  differentiation  of  the  underlying 
measure,  so  the  parameters  of  interest  should  appear  in  the  underlying  (input)  distributions.  If  this  is  not 
the  case,  then  one  approach  is  to  try  to  “push”  the  parameter  out  of  the  performance  measure  into  the 
distribution,  so  that  the  usual  differentiation  can  be  carried  out.  This  is  achieved  by  a  change  of  variables, 
which  is  problem  dependent. 

Recall  that  we  introduced  the  idea  of  a  weak  derivative  by  expressing  the  derivative  of  a  density  as  the 
difference  of  two  measures,  i.e., 


^^l  =  c(9)  (f<-2Hx-,0)-fW(x-,0)). 

This  idea  can  be  generalized  without  the  need  for  a  differentiable  density,  as  long  as  the  integral  exists  with 
respect  to  a  certain  set  of  (integrable)  “test”  functions,  say  £,  e.g.,  the  set  of  continuous  bounded  functions. 

Definition.  The  triple  (c(0),  F^,  F^)  is  called  a  weak  derivative  for  distribution  (c.cl.f.)  F  if  for  all 
functions  g  G  C  (not  a  function  of  9), 

J  g(x)dF(x-,9)  =c(9)  (/  g(x)dF{-2\x\9)  -  J  g{x)dF{1)  (x\9) 

Remarks.  As  mentioned  earlier,  the  derivative  is  “weak”  in  the  sense  that  the  density  derivative  may  not 
be  defined  in  the  usual  sense,  but  in  terms  of  generalized  functions  integrable  with  respect  to  the  functions 
in  £ ,  as  in  the  “definition”  of  a  delta  function  in  terms  of  its  integral,  i.e.,  they  could  include  discrete  mass 
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functions,  as  well.  The  concept  of  a  weak  derivative  need  not  be  restricted  to  probability  measures,  but 
any  finite  signed  measures.  Lastly,  note  that  a  WD  gradient  estimate  may  require  as  many  as  2d  additional 
simulations  for  the  vector  case  (a  pair  for  each  component),  unlike  LR/SF  and  IPA  estimators,  which  will 
always  require  just  a  single  simulation. 

One  choice  for  the  weak  derivative  (density)  that  is  readily  available  is 


df  , ,  f  , 

-e=c(h-h), 


(26) 


where 


a-HS 


a-KI 


(27) 


x+  =  max(x,0),a;  =  max(— x,  0),  and  c  =  f  dx  =  f  f2dx  =  f  dx ,  using  the  fact  that 

df 


Sf{x)ix= i  =*■  / 


dJ9 


dx  =  0. 


The  representation  given  by  (26)  and  (27)  is  the  Hahn-Jordan  decomposition,  which  will  always  exist  for 
probability  measures,  and  results  in  a  decomposition  involving  two  measures  with  complementary  support. 
Remarks.  The  representation  is  clearly  not  unique.  In  fact,  for  any  non-negative  integrable  function  h,  we 
have 

=  c([/i  +  h]  -  [f 2  +  h ])  =  c'  ^[/i  +  h\/(  1  +  J  h)  -  [/2  +  h\/(  1  +  J  h)^j  , 

where  c7  =  c(l  +  f  h).  Thus,  one  way  to  obtain  the  estimator  using  the  original  simulation  is  to  choose  a 
representation  in  which  both  measures  have  the  same  support  as  the  original  measure.  Then  importance 
sampling  can  be  applied,  so  that  the  original  simulation  can  be  used  to  generate  the  estimator  without  the 
need  for  simulating  the  system  under  alternative  input  distributions.  Perhaps  the  most  direct  way  to  achieve 
this  is  to  add  the  original  measure  itself  to  both  /i  and  /2  and  renormalize  appropriately,  i.e.,  choose  h  =  f 
above: 

|  =  2c([/1  +  /]/2-[/2  +  /]/2). 


4.4  Input  Distribution  Examples 

We  now  demonstrate  some  of  these  concepts  on  a  single  random  variable.  Section  5  will  consider  the  examples 
introduced  at  the  beginning. 

Example  2  Let  X  ~  exp (0),  an  exponential  random  variable  with  mean  0,  with  p.d.f.  given  by 

f(x;0)  =  ^e-x/el{x>O}. 

The  usual  construction  of  the  r.v.  is  as  follows: 


X(d;  u)  =  — 01n  u, 

where  u  represents  a  random  number.  Differentiating,  we  get 


df(x-  9) 
dd 


dX{9\  ii) 
d9 


x  1 

¥  e( 


=  f(x-,6) 


1 


1 

“  w e 


l{x  >  0} 


x  1 
¥  ~  6 


e  l¥e  X/Vlix>°}- 

1  r  e  /  x 


rM( I  -  0  '-"T"  >  *>  -  K1  -  f)  <^ei 

X(0;  u) 


=  —  In  //  = 
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In  the  third  and  fourth  lines  above,  the  density  derivative  (which  is  itself  not  a  density)  has  been  expressed 
as  the  difference  of  two  densities  multiplied  by  a  constant.  This  demonstrates  that  the  weak  derivative 
representation  is  not  unique,  with  the  last  decomposition  being  the  Hahn-Jordan  decomposition,  noting  that 
x  =  9  is  the  point  at  which  df(x:  0)/d6  changes  sign.  The  following  correspond  to  the  LR/SF,  WD  (a)  & 
(b),  and  IPA  estimators,  respectively: 


Y(X^,...)-Y(X^,...) 


1 

We  L 
dY  Xx 

dxWlT 


Y(X{2) 


lb  ’ 


...)-y(x«,...)' 


where  X^J  and  X[2J  are  random  variables  distributed  as  exp(9 )  and  Erl(2,9),  respectively,  and  X^  and 
are  random  variables  with  densities  |  (l  —  | )  e~x^e ,  0  <  x  <9,  and  |  (|  —  l)  e~x^e ,  x  >  9,  respectively. 


The  following  is  a  simple  example  that  demonstrates  that  the  WD  estimator  is  more  broadly  applicable 
than  the  LR/SF  estimator. 

Example  3  Let  X  ~  17(0,6*).  Then  we  have  the  following: 


f(x;9) 

=  ^1{0  <  x  <  9}, 

X(9;  u) 
df(x;  9) 
d9 

=  u9 

1 

9 

5(9  —  x)  —  ^-1{0  <  x  <  9} 

U 

=  [<5(6>  -  x)  -  f(x-,  6*)] , 

dX(9;  u) 
d9 

=  u 

X(9;  u) 

9  ’ 

where  we  define  the  Dirac  (5-function  as  the  “derivative”  of  a  step  function  by 


(28) 


l{x  >  9} 


[  5(y  -  9)dy. 

J  —  OO 


(29) 


On  the  right-hand  side  of  Equation  (28),  we  have  the  difference  of  densities  for  a  mass  at  9  and  the  original 
17(0,6*)  distribution,  respectively,  i.e.,  the  weak  derivative  representation  (1/9,9,  F),  where  6*  indicates  a 
deterministic  distribution  with  mass  at  9.  So,  for  example,  the  estimator  in  (19)  would  be  given  by 

-e(Y(9,X2,...)-Y(XuX2, ...)). 


This  is  a  case  where  the  LR/SF  estimator  is  ill-defined,  due  to  the  (5-function.  Another  example  is  the 
following. 

Example  4  Let  X  ~  Par  (a,  9),  which  represents  the  Pareto  distribution  with  shape  parameter  a  and  scale 
parameter  9,  and  p.d.f.  given  by 

f(x)  =6*“aa;-(“+1)l{:r  >6*}. 

Once  again  the  LR/SF  estimator  does  not  exist,  due  to  the  appearance  of  the  parameter  in  the  indicator 
function  that  controls  the  support  of  the  distribution,  whereas  WD  estimators  can  be  derived  (see  Table  1 
at  the  end  of  the  section) . 

However,  the  very  general  exponential  family  of  distributions  leads  to  a  nice  form  for  the  LR/SF  estimator. 
Example  5  Let  9  denote  the  vector  of  parameters  in  a  p.d.f.  that  can  be  written  in  the  following  form: 

f(x;9)  =  k(9)  exp  Vi(9)ti(x)\  h(x), 
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where  the  functions  h  and  {L}  are  independent  of  9 ,  and  the  functions  k  and  {14}  do  not  involve  the 
argument.  Then  it  is  straightforward  to  derive 


d  lnf(x;  9) 

89 


Vfc(0) 

m 


+  y ^yvi{8)u{x). 

i 


Examples  include  the  normal,  gamma,  Weibull,  and  exponential,  for  the  continuous  case,  and  the  binomial, 
Poisson,  and  geometric  for  the  discrete  case. 

Discrete  distributions  present  separate  challenges  for  the  two  approaches.  Basically,  when  the  parameter 
appears  in  the  support  probabilities,  then  LR/SF  can  be  easily  applied,  whereas  IPA  is  in  general  not 
applicable.  The  reverse  is  true,  however,  if  the  parameter  appears  instead  in  the  support  values.  Here  are  two 
examples  to  demonstrate  this,  where  we  work  directly  with  probability  mass  functions  p{x;  8)  =  P{X  =  a;), 
instead  of  densities  with  ^-functions.  Let  Ber(p;  a,  b)  denote  a  Bernoulli  distribution  that  takes  value  a  with 
probability  p  and  b  with  probability  1  —  p. 


Example  6  Let  X  ~  Ber(8;  a,  b),  which  has  mass  function 


p(x;  8)  =  8  ■  l{x  =  a}  +  (1  —  8)  ■  l{x  =  b}, 


so 


_  =  l{a :  =  a}  -  l{x  =  b}, 

which  can  be  viewed  as  the  difference  of  two  (deterministic)  masses  at  a  and  b  (with  c(8)  =  1),  and  is  the 
Hahn- Jordan  decomposition  in  this  case.  For  the  LR/SF  estimator,  we  have 

dlnp  l{x  =  a}  -  l{x  =  bj  1  _ _ 1 

88  81{x  =  a}  +  (1  ~  9)l{x  =  bj  0^X  1  -  9^X 

In  this  case,  there  is  no  way  to  construct  X  such  that  it  will  be  differentiable  w.p.l.  For  example,  the  natural 
construction/representation 


X  =  al{U  <  9}  +  bl{U  >  9} 
yields  dX/d8  =  0  w.p.l,  so  IPA  is  not  applicable. 

Example  7  Let  X  ~  Ber(p ;  9 ;  0),  which  has  mass  function 

p(x;  9)  =  p-  l{x  =  9}  +  (1  —  p)  ■  l{x  =  0}, 

which  is  not  differentiable  with  respect  to  9,  so  LR/SF  and  WD  estimators  cannot  be  derived. 
The  natural  random  variable  construction 


X  =  0  ■  1{U  <  p} 


leads  to  the  unbiased 


dX 

~d9 


1  {U  <p}  =  1{X  =  8} 


(which  in  this  example  also  happens  to  equal  X/8).  The  IPA  estimator  dX/dd  =  1{A  =  8}  holds  even  if 
any  number  of  additional  values  are  added  to  the  underlying  support,  if  all  of  them  do  not  involve  9.  If  9 
enters  into  them,  then  the  estimator  can  be  easily  modified  to  reflect  that. 

It  is  straightforward  to  verify  the  corresponding  derivatives  of  some  common  distributions  displayed  in 
Table  1,  where  geo{p)  indicates  a  geometric  distribution  with  mass  function 


(1  -P)n  1P,  *  =  1,2,...; 


bin(n,p)  indicates  a  binomial  distribution  with  mass  function 


Px(i-Py 


x  =  0,1, 


n 

P 
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negbin(n,p)  indicates  a  negative  binomial  distribution  with  mass  function 


x  —  1 
n  —  1 


(1  ~p)x  V\  #  =  n,  n  +  1,  •  •  • ; 


Poi(X)  indicates  a  Poisson  distribution  (mean/ variance  A)  with  mass  function 

e_AA"  „  _ n  i 

i  ■>  %  0)  i ?  •  •  •  ? 

x\ 

Erl(n,f3)  indicates  an  Erlang  distribution  with  p.d.f. 

o-n^n- 1  -x/j3 

P  *  e  1{£>0}; 

(«-!)! 

Mxw(fi,  a2)  indicates  a  double-sided  Maxwell  distribution  with  p.d.f. 

(x  —  p)2  (*-<u2 


•\/27ri 


7T(7J 


Wei(a,/3)  indicates  a  Weibull  distribution  with  shape  parameter  a  and  scale  parameter  (3,  and  p.d.f. 

ap-axa-le-{x/l3rl{x  >  0}; 

gam(a,  f3 )  indicates  a  gamma  distribution  with  shape  parameter  a  and  scale  parameter  f3,  and  p.d.f. 

/3~axa~1e~x//3 


P(a) 


-l{x  >  0}, 


where 


noo 

r(a)  =  /  ta^1e~tdt  =  (a  —  l)r(a  —  1); 

Jo 


recalling  that  Erl(n, /3)  =  gam(n,P)  for  positive  integer  n;  Wei(  1,/d)  =  gam(l.  0)  =  exp(/3)-  and  the  special 
distribution  F*(a,  9)  has  p.d.f.  (does  this  fall  into  any  of  the  well-known  families?) 

a/r2a£2a-1e“(x//3)“. 

Note  that  our  definition  of  the  second  parameter  f3  in  the  gamma  and  Weibull  distributions  (also  corre¬ 
sponding  to  the  only  parameter  in  the  exponential  distribution)  is  the  inverse  of  what  is  usually  found  in  the 
literature  (Law  and  Kelton  2000  being  a  notable  exception),  but  leads  to  it  being  a  scale  parameter  under 
our  definition. 


5  Examples 

5.1  Stochastic  Activity  Network 

Let  Xi  have  p.d.f.  fi,  and  assume  all  of  the  activity  times  are  independent.  Let  P*  denote  the  set  of  activities 
on  the  optimal  path  corresponding  to  the  project  duration  (e.g.,  shortest  or  longest  path,  depending  on  the 
problem),  so  we  can  write 

Y= 

jeP* 

where  P*  itself  is  a  random  variable.  We  wish  to  estimate  dE[Y]/d0.  Let  9  be  a  parameter  in  the  distribution 
of  Xi,  i.e. ,  in  /i  only.  Then,  the  IPA  estimator  is  given  by 


dY 

d9 


dX  i 
~df 


1{1  G  P*}. 
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input  dist 

WD 

IPA 

LR/SF 

A  ~  F 

(c,F(2),F«) 

dX 

_ _ 

dlnf(X) 

_ d# _ 

Ber{9\  a ,  b) 

(1  ,a,b) 

NA  U{X  =  a] 

- 

j^l{X  =  b} 

Ber{p\  6,  b) 

NA 

HX  =  9} 

NA 

geo(6) 

{jj,geo(9),negbin(2,9)) 

NA 

1  1  1-X 
e  ^  i-e 

bin(n ,  9) 

(n,  1  +  bin(n  —  1,9),  bin(n  —  1,9)) 

NA 

X  n-X 

8  1-8 

Poi(9) 

(1,1  +  Poi(9),Poi(9)) 

NA 

X  _  1 

8  1 

N(9,a2) 

(7^,0  +  Wei^  We^  2^)) 

1 

X-8 

a2 

N  (/i,  92) 

(l,Mxw(g,92),N(y,92)) 

X-n  1 

6  e 

(Y)  - 1 

U(0,9) 

(b9,U(  0,9)) 

X 

6 

NA 

U(6  —  7, 9  +  7) 

(^>#  +  7.0-7) 

1 

NA 

U(y  —  9,  g  +  9) 

(^,Ber( 0.5;  g  -  9,  g  +  9),  U{g  -  9,  g  +  9)) 

X-n 

G 

NA 

exp(9) 

( \,Erl(2,6),exp{9 )) 

X 

e 

Wei(a,  9) 

(f  ,F*(a,9),Wei(a,9)) 

x  1 

e  e 

(*)“-« 

gam{a,  9) 

(j,gam(a  +  1, 9),gam(a,  9)) 

X 

e 

m-«) 

Par(a ,  9) 

{%,Par{a,9),9) 

X 

_ e _ 

NA 

Table  1:  Derivatives  for  some  common/simple  input  distributions  (“NA”  =  not  applicable). 


The  LR/SF  estimator  is  given  by 


The  WD  estimator  is  of  the  form 


y  din  \  0) 

dQ 


c{0)  (y(x[2),X2,  . . . ,  XT)  -  r(xf\  X2, . . . ,  Xr)) 

where  x[^  ~  F^\j  =  1,2,  and  {c(9),  F^\  F^)  is  a  weak  derivative  for  F\, 

If  we  allow  the  parameter  to  possibly  appear  in  all  of  the  distributions,  then  the  IPA  estimator  is  found 
by  applying  the  chain  rule: 


(FY  \  ^  dXi 
~dd  =  ^  ~df  ] 

ieP* 


whereas  the  LR/SF  and  WD  estimators  are  derived  by  applying  the  product  rule  of  differentiation  to  the 
underlying  input  distribution,  applying  the  independence  that  allows  the  joint  distribution  to  be  expressed 
as  a  product  of  marginals.  In  particular,  the  LR/SF  estimator  is  given  by 


Y{X) 


E 


a  in  fi(Xi-,e) 

de 


The  WD  estimator  is  of  the  form 


T 

EciW  (E*1,  •  •  •  >E2)- •  •  •  ,Xt)  -  Y(X r, . . .  ,xp, 

i=  1 


where  xj^  ~  F^ ,  j  =  1,  2;  i  =  1, . . . ,  T,  and  (cj(0),  F f2-1 ,  F. j1'1)  is  a  weak  derivative  for  Fj. 

If  instead  we  were  interested  in  estimating  P(Y  >  y)  for  some  fixed  y,  the  WD  and  LR/SF  estimators 
would  simply  replace  Y  with  the  indicator  function  1{P  >  y}.  However,  IPA  will  not  apply,  since  the 
indicator  function  is  inherently  discontinuous,  so  an  extension  of  IPA  such  as  SPA  is  required.  On  the 
other  hand,  if  the  performance  measure  were  P(Y  >  9),  then  since  the  parameter  does  not  appear  in  the 
distribution  of  the  input  random  variables,  WD  and  LR/SF  estimators  cannot  be  derived  without  first 
carrying  out  an  appropriate  change  of  variables. 
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To  derive  a  PA  estimator  for  the  derivative  of  P(Y  >  y),  we  first  define  the  following: 

Vj  =  {P  G  V  |  j  G  P}  =  set  of  paths  containing  arc  j, 
\P\~j  =  length  of  path  P  with  X3  =0. 


The  idea  will  be  to  condition  on  all  activity  times  except  a  set  that  includes  activity  times  dependent  on  the 
parameter.  In  order  to  proceed,  we  need  to  specify  the  form  of  Y .  We  will  take  it  to  be  the  longest  path. 
Other  forms,  such  as  shortest  path,  are  analogously  handled.  Again,  assuming  that  9  occurs  in  the  density 
of  Xi  and  taking  the  conditioning  quantities  to  be  everything  except  X±,  i.e.,  Z  =  { X2 , . . . ,  At},  we  have 


Lz(9)  =  Pz(Y  >y)  =  E[1{Y  >  y}\X2, . . . ,  XT] 


1  ifmaxpiG-p  |Pj|_i  >  y, 

Pz  (ma xpiG-p1  |Pj|  >  y)  otherwise; 


where  Pz  denotes  the  conditional  (on  Z)  probability.  Since 


Pz{  max  \Pi\  >y)  =  Pz{X 1  +  max  |Pj|_i  >  y) 


Pz(X  1  >  y  -  max  |P»|_i) 


Fi(y~  mjK  |P»|-i), 


Differentiating, 


Lz(0)  =  Fi(y 


max  |Pj|_i)  •  1  { max  |Pj|_i  <  y}  +  l{max  |P|_i 
PitVi  PiGV  PiGV 


>  y}- 


dLz 

~df 


max  |Pj|_i)  •  1  { max  |P»|_i  <  y}. 
PiGVi  Pi£V 


For  the  shortest  path  problem,  simply  replace  “max”  with  “min”  throughout. 

To  derive  an  LR/SF  or  WD  derivative  estimator  for  the  performance  measure  P(Y  >  9 ),  there  are  two 
main  ways  to  do  the  change  of  variables:  subtraction  or  division. 


P(Y  —  9  >  0),P(Y/9  >  1). 


To  achieve  this  requires  translating  the  operation  on  the  output  performance  measure  back  to  a  change  of 
variables  on  the  input  random  variables,  so  this  clearly  requires  some  additional  knowledge  of  the  system 
under  consideration.  In  this  particular  case,  it  turns  out  that  two  properties  make  it  amenable  to  a  change  of 
variables:  (i)  additive  performance  measure;  (ii)  clear  characterization  of  paths  that  need  to  be  considered. 
The  simplest  change  of  variables  is  to  take 

Xi  =  Xi/0  Vi  G  A, 

so  that  9  now  appears  as  a  scale  parameter  in  each  distribution  /j.  If  Y  represents  the  performance  measure 
after  the  change  of  variables,  then  we  have 

P(Y  >9)  =  P(Y  >  1), 


and  this  can  be  handled  as  previously  discussed. 

Another  change  of  variables  that  will  work  in  this  case  is  to  subtract  the  quantity  9  from  an  appropriate 
set  of  arc  lengths.  In  particular,  the  easiest  sets  are  the  nodes  leading  out  of  the  source  or  the  nodes  leading 
into  the  sink.  Note  that  minimal  cut  sets  will  not  necessarily  do  the  trick.  Again,  if  Y  represents  the 
performance  measure  after  the  change  of  variables,  then  we  have 

P(Y  >9)  =  P{Y  >  0). 


Now  the  parameter  9  appears  in  the  distribution,  specifically  as  a  location  parameter,  but  only  in  a  relatively 
small  subset  of  the  {/j}.  Since  this  transformation  results  in  the  parameter  appearing  in  fewer  number  of 
input  random  variables,  it  may  be  preferred,  because  for  both  the  LR/SF  and  WD  estimators,  the  amount 
of  effort  is  proportional  to  the  number  of  times  the  parameter  appears  in  the  distributions.  The  extra  work 
for  a  large  network  can  be  particularly  burdensome  for  the  WD  estimator.  However,  for  the  LR  estimator, 
this  type  of  location  parameter  is  problematic,  since  it  changes  the  support  of  the  input  random  variable. 

Application  of  PA  to  the  problem  of  estimating  dP{Y  >  9)/d9  also  requires  SPA,  the  idea  being  to 
condition  on  all  activity  times  except  a  special  (minimal)  set  that  includes  at  least  one  activity  that  must  be 
on  the  optimal  (critical)  path.  Taking  any  minimal  cut  set,  one  can  derive  an  unbiased  derivative  estimator 
for  P(Y  >9)  by  conditioning  on  the  complement  set  and  then  taking  the  sample  path  derivative.  The  details 
are  contained  in  Fu  (2005). 
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Figure  1:  Quantities  used  in  deriving  FCFS  single-server  queue  SPA  estimator. 


5.2  Single-Server  Queue 

We  illustrate  all  of  the  gradient  estimators  for  the  queueing  example.  Let  An  be  the  interarrival  time  between 
the  (n  —  l)st  and  nth  customer  (i.i.d.  with  PDF  /i  and  CDF  Fi),  Xn  the  service  time  of  the  nth  customer 
(i.i.cl.  with  PDF  fi  and  CDF  F2),  and  Tn  the  system  time  (in  queue  plus  in  service)  of  the  nth  customer. 
We  consider  the  case  where  6  is  a  parameter  in  the  service  time  distribution,  and  the  sample  performance  of 
interest  is  the  average  system  time  over  the  first  N  customers  T n  =  Yln=  1  Assume  that  the  system 
starts  empty,  so  that  the  times  of  the  first  N  service  completions  are  completely  determined  by  the  first  N 
interarrival  times  and  first  N  service  times. 

The  system  time  of  a  customer  for  a  FCFS  single-server  queue  satisfies  the  well-known  recursive  Lindley 
equation: 

Tn+ 1  =  Xn+i  +  (Tn  —  An+i)+ .  (30) 

The  IPA  estimator  is  obtained  by  differentiating  (30): 


dTn+ 1 
dB 


dXn. |_i 
dO 


+  >  An+ 1}. 


Using  the  above  recursion,  the  IPA  estimator  for  the  derivative  of  average  system  time  is  given  by 


(31) 


dTN 

d.e 


n= 1 


dTn 

de 


^  m  ^  i  dXj 

/V  dO 

m= 1  i—rim- i  +  l  J— nm_i  +  l 


(32) 


where  M  is  the  number  of  busy  periods  observed  and  nm  is  the  index  of  the  last  customer  served  in  the 
mth  busy  period  (no  =  0  and  tim  =  X  for  M  complete  busy  periods).  Implementation  of  the  estimator 
involves  keeping  track  of  two  running  quantities,  one  for  (31)  and  another  for  the  summation  in  (32);  thus, 
the  additional  computational  overhead  is  minimal,  and  no  alteration  of  the  underlying  simulation  is  required. 

The  implicit  assumption  used  in  deriving  an  IPA  estimator  is  that  small  changes  in  the  parameter  will 
result  in  small  changes  in  the  sample  performance.  Thus,  in  the  above,  this  means  that  the  boundary 
condition  in  (31)  is  unchanged  by  differentiation.  In  general,  the  interchange  (12)  will  typically  hold  if  the 
sample  performance  is  continuous  with  respect  to  the  parameter.  For  the  Lindley  equation,  although  Tn+ 1 
in  (30)  has  a  “kink”  at  Tn  =  An+ 1,  it  is  still  continuous  at  that  point.  This  intuitively  explains  why  IPA 
works.  Unfortunately,  the  “kink”  means  that  the  derivative  given  by  (31)  has  a  discontinuity  at  Tn  =  An+ 1, 
so  that  IPA  will  fail  for  the  second  derivative. 

An  unbiased  SPA  second  derivative  estimator  can  be  derived  by  conditioning  on  all  previous  interarrival 
and  service  times  at  each  departure,  which  determines  the  system  time,  say  Tn,  with  the  corresponding 
next  interarrival  time,  An+ 1,  unconditioned.  We  provide  a  brief  informal  derivation  based  on  sample  path 
intuition  (refer  to  Figure  1).  For  the  right-hand  estimator,  in  which  we  assume  A Tn  >  0  (technically  it 
should  refer  to  Ad),  the  only  “critical”  events  are  those  departures  that  terminate  a  busy  period,  with  the 
possibility  that  two  busy  periods  coalesce  (idle  period  disappears)  due  to  a  perturbation.  The  corresponding 
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probability  rate  (conditional  on  Tn)  is  then  calculated  as  follows: 

P(Tn  +  ATn  >  An+i\Tn  <  An+i)  _  fi{Tn)  dTn 
a™o  A Yn  “  1  —  Fi(Tn)  ~dF  ’ 

and  the  corresponding  effect  would  be  that  the  A Tn  perturbation  would  be  propagated  to  the  next  busy 
period.  The  complete  SPA  estimator  is  given  by 

fd2TN\  1  y,  ^  A  d2X3 

V  d62  )  qpA  ~  N  ^  ^  ^  dd 2 

,  1  y  h(Tnm)  ( dTn\ 2 

where  d2X/dd 2  is  well-defined  when  F2(X;9)  is  twice  differentiable,  4^5-  =  0  for  location,  scale,  and  gener¬ 
alized  scale  parameters. 

To  derive  the  LR/SF  estimator,  since  all  the  interarrival  and  service  times  are  independently  generated, 
the  joint  p.d.f.  /  on  X  will  simply  be  the  product  of  the  p.cl.f.s  of  the  interarrival  and  service  time  distributions 
given  by 


M  An,  X,, ...,  XN)  =  []  hiAi)  f2(Xt- 9). 


Thus,  the  straightforward  estimator  would  be  given  by 

( dfN\  1  y^dln  f2(Xj-9) 

V  d9  )TR  89  ’  ^  ] 

where  expressions  for  some  common  input  distributions  can  be  found  in  Table  1.  However,  for  large  N,  the 
variance  of  the  estimator,  which  increases  linearly  with  N,  will  render  it  practically  useless,  so  some  sort  of 
truncation  is  necessary,  and  in  this  particular  example,  the  regenerative  structure  provides  such  a  mechanism. 
Using  regenerative  theory,  the  mean  steady-state  system  time  can  be  written  as  a  ratio  of  expectations: 

MT1  =  MM 

[  1  m  ’ 

where  77  is  the  number  of  customers  served  in  a  busy  period  and  Q  is  the  sum  of  the  system  times  of  customers 
served  in  a  busy  period.  Differentiation  yields 

dE[T\  =  dE[Q}/d9  _  dE[r,]/d9 
d9  E[V\  E[rj]  1 

Now,  use  the  natural  LR/SF  estimators  for  each  of  the  terms  separately  to  obtain  the  following  regenerative 
estimator  over  M  busy  periods: 


dJ-N  \  _  \  '  I  t  ’ST 

99  ~  ~  n  2^  \  2^  i  2^ 


m—1  I  i—n-m—  i  +  l  £=nm_i  +  l 


8\n  f2{Xi-,9) 
89 


3 In h(Xl',9)  I  j_  A 
89 


i=nm_  i  +  l 


The  advantage  of  these  estimators  is  that  the  summations  are  bounded  by  the  length  of  the  busy  periods, 
so  as  long  as  the  busy  periods  are  not  too  lengthy,  the  variance  of  the  estimators  should  be  acceptable. 
Furthermore,  the  second  derivative  estimator  is  relatively  easy  to  derive,  as  well: 


1  M  I  nm  rim 

=  sE  S  £ 


m—1  I  i=l  i=rim- i+l 


92ln/2(Ai;0)  ,  f  8ln  f2(Xi- 9) 


vE  j  (nm  nm-l) 


i=nm_  i  +  l 


82lnf2(Xf,9)  ,  f8lnf2(Xi;9)\2  I  1 
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The  WD  estimator  is  also  relatively  straightforward,  just  incorporating  the  product  rule  of  differentiation 
as  before: 

(^f)  =c(0)'52\fN(A1„..,AN,...,xP,...)-TN(A1,...,AN,..,txP,...) }  , 

\  aU  /  WD  l=1  1  J 

where  X-^  ~  F^\j  =  1,2;*  =  for  (c(6),  F^\  F^)  a  weak  derivative  of  F2.  A  second  derivative 

estimator  would  take  exactly  the  same  form,  the  only  difference  being  that  (c(0),  F^\  F^)  should  be  a 
weak  second  derivative  of  F^.  Note  that  in  general,  implementation  of  the  estimator  requires  2 T  separate 
sample  paths  and  resulting  sample  performance  estimates,  because  the  parameter  appears  in  T  input  random 
variables.  Although  the  variance  of  the  estimator  does  not  increase  with  T,  implementation  may  not  be 
practical  for  large  T.  However,  in  many  cases,  the  expression  can  be  simplified,  making  the  computation  more 
acceptable.  The  variance  properties  of  the  estimators  depend  heavily  on  the  particular  weak  derivative(s) 
used. 

What  happens  when  T  is  random  and  when  T  — >  00  is  also  of  theoretical  interest.  In  these  settings,  the 
estimators  are  extended  in  the  natural  way,  but  proving  theoretical  properties  becomes  a  more  challenging 
task. 


5.3  (s,  S )  Inventory  System 

We  now  consider  the  single-item  periodic  review  (s,  S)  inventory  system,  in  which  once  every  period  the 
inventory  level  is  reviewed  and,  if  necessary,  orders  are  placed  to  replenish  depleted  inventory.  An  (s,  S) 
ordering  policy  specifies  that  an  order  be  placed  when  the  level  of  inventory  on  hand  plus  that  on  order 
(called  inventory  position)  falls  below  the  level  s,  and  that  the  amount  of  the  order  be  the  difference  between 
S  and  the  present  inventory  position,  i.e.,  order  amounts  are  placed  “up  to  S.”  For  expositional  ease,  we 
describe  only  the  gradient  estimate  for  average  inventory  level  with  respect  to  the  policy  parameters  s  and 
q  =  S  —  s,  which  do  not  enter  through  probability  distributions  as  in  the  previous  examples. 

We  consider  the  model  where  all  excess  demand  is  backlogged  and  eventually  filled,  and  replenishment 
orders  are  immediately  received  (zero  lead  time),  so  that  inventory  level  and  inventory  position  coincide.  We 
assume  that  during  the  period,  demand  is  satisfied,  and  then  the  order  replenishment  decision  is  made  at 
the  end  of  the  period.  Let  Dn  be  the  demand  in  period  n,  which  is  assumed  i.i.d.  with  respective  density 
and  distribution  functions  given  by  /  and  F ,  and  let  Vn  be  the  inventory  level  in  period  n  after  demand 
satisfaction  at  the  end  of  the  period,  but  just  prior  to  the  order  replenishment  decision.  This  quantity 
satisfies  a  recursive  equation  somewhat  analogous  to  the  Lindley  equation: 


Vn+ 1 


Vn  F)n+ 1  if  Vn  ^  s, 
S  -  Dn+1  if  Vn  <  s. 


The  average  inventory  level  over  N  periods  is  given  by 


71=1 


(35) 


From  a  sample  path  point  of  view,  the  key  discrete  event  in  the  system  is  the  ordering  decision  each  period. 
A  change  in  s,  with  q  held  fixed,  has  no  effect  on  these  decisions,  so  infinitesimal  perturbations  in  s  result 
in  infinitesimal  changes  in  the  inventory  level  and  in  the  sample  performance  function  Vn.  In  particular, 
for  a  perturbation  of  size  As  (of  any  size,  not  necessarily  infinitesimal),  Vn(s  +  As)  =  Vn(s)  +  As,  and  thus 
dV n/9s  =  1  is  an  unbiased  estimator  for  dE[V n}/9s.  Intuitively,  the  shape  of  the  sample  path  is  unaltered 
by  changes  in  s  if  q  is  held  constant;  the  entire  sample  path  is  merely  shifted  by  the  size  of  the  change 
(assuming  starting  inventory  of  Vo  =  S  —  s  +  q;  else,  the  statement  only  holds  starting  from  the  first  order 
point).  The  IPA  estimator  can  also  be  obtained  by  simply  differentiating  the  recursive  relationship  (35), 
noting  that  Dn  does  not  depend  on  s  or  q: 

dVn+ 1 
d9 
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if  Vn  >  s, 
if  Vn  <  s, 


inventory  level 


Figure  2:  Quantities  used  in  (s,  S)  inventory  system. 


for  either  9=s  or  9=q.  Under  the  assumption  that  Vo=S=s  +  q ,  the  expression  reduces  to  1  for  all  n  when 
9  =  s. 

On  the  other  hand,  a  change  in  q  with  s  held  fixed  may  cause  a  change  in  the  set  of  ordering  decisions, 
resulting  in  a  drastic  change  in  the  sample  path  and  hence  in  the  performance  measure  V n-  Thus,  SPA  is 
required  to  derive  an  unbiased  gradient  estimator  for  9  =  q.  We  provide  a  brief  informal  derivation  for  a 
right-hand  SPA  estimator,  i.e. ,  A q  >  0,  in  which  a  period  where  a  replenishment  order  was  originally  placed 
could  become  one  where  an  order  is  not  placed,  for  sufficiently  large  A q  (refer  to  Figure  2).  To  calculate  the 
probability  rate  for  such  an  event  requires  knowing  the  quantity  above  the  reorder  point  s  prior  to  demand 
being  realized  in  the  period,  given  by  £„  =  Yn  —  s  in  Figure  2,  where  Yn  is  the  inventory  position/level  prior 
to  demand  being  realized.  Conditioning  on  all  demands  except  the  one  in  the  period  of  interest  (in  which 
an  order  is  placed),  Dn,  the  corresponding  probability  rate  is  then  given  by 

,.  P($n  +  Ag>  Dn\jn  <  Dn)  =  /(&) 

A™0  A q  1  -  F(£„)  ■ 


The  resulting  conditional  expected  effect  on  the  performance  measure  requires  some  further  analysis,  and  is 
omitted  here,  and  we  simply  give  the  final  SPA  estimator  for  dE[V n\/ 9q,  which  can  be  easily  and  efficiently 
estimated  from  the  original  sample  path: 


1 + *„<£<. 1  -  •"(y 


/&,) 


>  -  E[D]  -  V n]  , 


where  E[D]  is  mean  demand. 

The  LR/SF  and  WD  methods  require  a  change  of  variables  in  order  to  move  the  parameters  into  the 
distribution,  which  requires  some  non-trivial  analysis;  see  Pflug  and  Rubinstein  (2002)  for  details. 
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6  Basic  Theoretical  Tools 


The  key  result  used  in  the  theoretical  proofs  of  unbiasedness  is  the  (Lebesgue)  dominated  convergence 
theorem  that  allows  for  the  exchange  of  limit  and  expectation  operators  required  in  Equations  (12)  or  (13): 

Theorem  1  (Dominated  Convergence  Theorem)  If  lin^^oo  gn  =  g  a.s.  and  \gn\  <  M  Vn  a.s.  with 
E[M]  <  oo,  then  lim^o 0E[gn\  =  E[g\. 

Take  AO  — >  0  instead  of  n  — *  oo,  and  g  is  the  gradient  estimator,  so  g&8  is  the  limiting  sequence  that 
defines  the  sample  (path)  gradient.  Verifying  that  an  actual  bound  exists  is  often  a  non-trivial  task  in 
applications,  especially  in  the  case  of  perturbation  analysis. 

Looking  at  (12),  we  translate  these  conditions  to 


9A8 


Y{9  +  A9)-Y{9) 

A 9 


w  J(x-,e  +  A9)-f(x-,9) 

9A8  =  1  {X) - - , 

for  IPA  and  LR/SF,  respectively. 

For  IPA,  the  following  generalization  of  the  mean  value  theorem  is  most  useful: 

Theorem  2  (Generalized  Mean  Value  Theorem)  Let  Y  be  a  continuous  function  that  is  differentiable 
on  a  compact  set  0  =  0 \D,  where  I?  is  a  set  of  countably  many  points.  Then,  V0,  9  +  A 9  £  0, 


Y(9  +  AO)  -  Y(9) 

dY 

A  9 

—  suP0Ge 

d9 

If  Y(0)  can  be  shown  to  be  continuous  and  piecewise  differentiable  on  0  w.p.l,  then  it  just  remains  to 
show 


E 


suPeee 


dY 

~dO 


<  oo, 


to  satisfy  the  conditions  required  for  unbiasedness  via  the  dominated  convergence  theorem.  Basically,  in  order 
for  the  chain  rule  to  be  applicable,  the  sample  performance  function  needs  to  be  continuous  with  respect 
to  the  underlying  random  variable  (s).  This  translates  into  requirements  on  the  form  of  the  performance 
measure  and  on  the  dynamics  of  the  underlying  stochastic  system. 

For  the  LR/SF  method,  the  bound  is  applied  to  the  (joint)  density  (or  mass)  function.  Note  that  the 
bound  is  for  f(x\  9)  with  respect  to  the  parameter  9  and  not  its  usual  argument.  For  WD,  the  required 
interchange  is  guaranteed  by  the  definition  of  the  weak  derivative,  but  the  sample  performance  must  be  in 
the  set  of  “test”  functions  C  in  the  definition,  which  again  generally  requires  the  dominated  convergence 
theorem  (or  uniform  integrability,  which  is  usually  difficult  to  check  directly). 

The  previous  examples  can  be  used  to  show  in  very  simple  cases  where  difficulties  arise. 

Consider  the  p.d.f. 

f(x;  0)  =  ^1{0  <  x  <  6»}, 

where  the  LR/SF  method  does  not  apply.  In  this  case,  /  viewed  as  a  function  of  9  for  fixed  x  has  a 
discontinuity  at  9  =  x. 

Consider  the  function 

P(Y  >  y)  =  E[1{Y  >  y}]. 


in  which  IPA  will  not  work.  In  this  case,  the  performance  measure  is  an  indicator  function,  which  is 
discontinuous  in  its  argument. 

Thus,  in  both  simple  examples,  the  dominated  convergence  theorem  is  not  applicable  as  the  required 
quantity  cannot  be  bounded.  However,  it  is  only  a  sufficient  condition,  not  necessary,  so  in  some  (very) 
special  cases,  unbiasedness  may  hold  even  without  the  boundedness  (continuity). 
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7  Simple  Guidelines  for  the  Simulation  Practitioner 

We  summarize  the  most  important  considerations  in  applying  the  three  direct  gradient  procedures  (PA, 
LR/SF,  WD): 

•  IPA:  cannot  handle  “bad  performance  measures,  e.g.,  indicator  functions,  or  non-smooth  systems;  one 
way  of  checking  the  latter  is  the  commuting  condition,  which  checks  event  sequences  in  the  system 
(cf.  Glasserman  1991);  smoothness  may  depend  on  system  representation  (see  discussion  on  role  of 
representations) ; 

•  SPA:  choice  of  what  to  condition  on,  and  how  to  compute  (estimate)  resulting  conditional  expectation; 
may  require  many  additional  simulations; 

•  LR/SF  and  WD:  may  encounter  difficulties  in  handling  parameters  not  in  distribution  (so-called  “struc¬ 
tural”  parameters) ;  may  need  to  try  a  change  of  variables; 

•  LR/SF:  if  the  parameter  appears  in  an  input  distribution  that  is  reused  frequently  (parameter  makes 
too  many  repeat  appearances),  e.g.,  i.i.d.  service  times  in  a  queueing  system,  then  may  need  to  find  a 
way  to  truncate  the  estimator  to  mitigate  the  linear  increase  in  variance; 

•  WD:  choice  of  which  (non-unique)  WD  representation  to  use;  also  high-dimensional  vectors  may  require 
many  simulations; 

•  For  discrete  distributions,  IPA  works  if  the  parameter  occurs  in  the  support  values,  whereas  LR/SF 
and  WD  work  if  the  parameter  occurs  in  the  support  probabilities; 

•  Higher  derivative  estimates  are  usually  easiest  to  derive  using  LR/SF,  but  even  then  the  variance  of 
the  resulting  estimators  may  be  problematic. 

The  characteristics/choices  in  applying  SPA  are  strikingly  similar  to  the  use  of  conditional  Monte  Carlo 
for  variance  reduction.  The  choice  of  what  to  condition  on  in  applying  SPA  also  has  analogs  to  the  WD 
representation  choice.  The  simplest  procedures  to  use  are  the  LR/SF  and  IPA  estimators.  The  WD  estimator 
may  be  easier  to  apply  than  SPA,  because  one  has  a  set  of  “standard”  choices  for  a  large  class  of  distributions. 
However,  there  is  no  guarantee  that  these  choices  are  necessarily  good  for  a  particular  application. 

It  should  be  clear  from  this  brief  summary  that  the  direct  gradient  estimation  techniques  may  require 
some  effort  on  the  part  of  the  simulation  user,  whereas  the  indirect  techniques  are  straightforward  to  apply. 
To  put  it  another  way,  for  direct  gradient  estimation  sometimes  it  takes  some  art  to  do  the  science. 

8  Applications 

In  discussing  applications,  the  focus  is  on  the  direct  gradient  estimators,  since  application  of  the  indirect 
gradient  estimates  is  generally  more  straightforward  and  domain  independent.  The  most  dominant  applica¬ 
tion  of  direct  gradient  estimation  in  the  research  literature  is  clearly  queueing  systems;  however,  inventory 
management  and  financial  engineering  have  employed  many  of  these  results  in  real-world  practice. 

For  derivatives  with  respect  to  distributional  parameters,  IPA  can  be  applied  in  any  single-class  Jackson- 
like  network.  Here,  “Jackson-like”  (also  called  a  “generalized  Jackson  network”)  means  that  the  network 
retains  all  of  the  usual  characteristics  of  a  Jackson  network  except  that  service  times  and  interarrival  times 
(in  an  open  network)  do  not  have  to  be  exponentially  distributed.  This  is  also  true  for  the  LR/SF  method 
and  the  WD  method.  In  the  latter  case,  there  are  often  choices  of  the  WD  used.  However,  if  the  parameter 
is  the  routing  probabilities,  then  IPA  cannot  be  applied  directly.  Also,  if  the  network  is  extended  to  multiple 
classes  of  customers,  then  IPA  is  often  not  applicable. 

It  is  relatively  straightforward  to  apply  the  LR/SF  method  to  queueing  systems,  the  only  caveat  being 
the  potential  variance  problem  mentioned  in  the  previous  section.  It  is  also  relatively  straightforward  to 
derive  weak  derivative  estimators,  but  there  are  generally  many  choices  of  distributions,  and  often  one  or 
more  additional  simulations  using  different  input  distributions  will  be  necessary. 

Inventory  systems  is  a  domain  in  which  direct  gradient  estimation  has  been  successfully  applied  in  real- 
world  applications.  Because  the  parameters  of  interest  are  usually  those  controlling  replenishment  decisions 
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as  in  the  (s,  S)  inventory  control  example  of  Section  5,  PA  is  most  relevant.  For  so-called  base-stock  policies, 
often  IPA  suffices.  Such  IPA  estimators  are  being  used  to  optimize  inventory  management  in  the  worldwide 
supply  chain  of  Caterpillar,  and  the  success  of  the  approach  is  reported  in  a  Fortune  magazine  article,  “New 
Victories  in  the  Supply  Chain  Revolution”  (by  Philip  Siekman,  October  30,  2000);  technical  details  of  the 
approach  can  be  found  in  Kapuscinski  and  Tayur  (1999).  For  more  complicated  ordering  policies  such  as  an 
(s,  S)  policy  that  comes  about  with  the  inclusion  of  a  fixed  order  cost,  extensions  of  IPA  are  required;  see  Fu 
and  Healy  (1992,  1997),  Fu  (1994b),  Bashyam  and  Fu  (1994),  Fu  and  Hu  (1994),  Zhang  and  Fu  (2005),  Pflug 
and  Rubinstein  (2002)  for  an  LR/SF  estimator  using  the  push-out  method  with  conditioning,  and  especially 
Chapter  7  of  Fu  and  Hu  (1997). 

Because  on  Wall  Street  and  elsewhere  in  the  global  financial  world,  Monte  Carlo  simulation  is  routinely 
used  for  pricing  and  hedging  derivatives,  it  is  now  commonly  included  in  any  finance  text  that  addresses 
numerical  solution  methods.  Simulation  can  easily  handle  the  pricing  of  high-dimensional  derivatives,  such 
as  path-dependent  claims  or  systems  with  a  large  number  of  underlying  assets  and/or  uncertainties  (e.g., 
stochastic  volatility  and  interest  rate  models).  In  hedging,  gradients  are  the  most  critical  ingredients  in 
determining  what  positions  need  to  be  taken  in  any  portfolio.  In  equities,  they  are  usually  referred  to  as 
“Greeks,”  because  they  are  represented  by  Greek  letters.  For  example,  the  most  common  one  is  the  Delta, 
which  is  the  sensitivity  of  a  derivative  price  with  respect  to  the  underlying  security  price,  e.g.,  the  price  of 
a  call  option  with  respect  to  the  underlying  stock  price  on  which  the  contract  is  written.  Delta  hedging  and 
other  types  of  hedging  are  described  in  most  elementary  finance  textbooks  on  derivatives  such  as  Hull  (2000), 
and  the  text  by  Clewlow  and  Strickland  (1999)  includes  simple  IPA  estimators  for  calculating  Greeks  using 
Monte  Carlo  simulation  (though  they  do  not  use  the  term  IPA  nor  stochastic  gradient  estimation).  Fu  and 
Hu  (1995)  and  Broadie  and  Glasserman  (1996)  were  the  first  to  develop  stochastic  gradients  in  these  settings; 
see  also  Glasserman  (2004).  Heidergott  (2001b)  considered  weak  derivatives  in  a  similar  setting  as  Fu  and 
Hu  (1995).  In  fixed  income  securities,  the  analogous  quantities  go  by  the  name  of  duration  and  convexity. 
In  Chen  and  Fu  (2002),  IPA  is  applied  to  the  pricing  and  hedging  of  mortgage-backed  securities,  where  both 
first  and  second  derivatives  are  required.  For  just  five  parameters,  if  symmetric  differences  were  used,  this 
would  require  236  (1  +  10  +  225)  simulations  for  each  set  of  performance  measures  and  gradient  estimates. 
In  actual  implementation,  there  was  a  resulting  dramatic  97.2%  reduction  in  computation  time.  Another 
important  finance  application  is  the  pricing  of  American-style  derivatives:  contingent  claims  in  which  early 
exercise  is  permitted.  One  of  the  earliest  proposed  approaches  to  this  problem  was  to  parameterize  the  early 
exercise  boundary,  thus  casting  the  optimal  stopping  problem  as  a  stochastic  optimization  problem  with 
respect  to  the  parameters.  The  earliest  example  of  applying  PA  and  SA  to  such  an  option  pricing  problem 
is  given  in  Fu  and  Hu  (1995).  Earlier  editions  of  Hull  (2000)  and  other  finance  texts  claimed  that  simulation 
could  be  used  only  to  price  European  options,  so  this  is  one  of  many  approaches  that  dispelled  that  belief. 
See  Glasserman  (2004)  for  other  approaches  and  references  on  this  topic. 

Other  applications  include  stochastic  activity  networks  (see  Rubinstein  and  Shapiro  1993  for  the  LR/SF 
method,  Bowman  1994  for  IPA,  and  Fu  2005  for  SPA  and  WD);  preventive  maintenance  (Fu,  Hu,  and  Shi 
1993;  Heidergott  1999,  2001a;  L’Ecuyer,  Martin,  and  Vazquez- Abad  1999);  statistical  process  control  (Fu 
and  Hu  1999);  traffic  light  signal  control  (Howell  and  Fu  2003;  also  mentioned  in  Rubinstein  and  Shapiro 
1993,  p.3,  as  an  example). 

9  Probing  Further 

Other  approaches  not  treated  here  include  frequency  domain  experimentation  (Schruben  and  Cogliano  1981; 
Schruben  1986;  Jacobson,  Buss,  and  Schruben  1991;  Jacobson  1994);  and  Malliavin  calculus,  which  has  been 
used  primarily  in  financial  applications  (e.g.,  Fournie  et  al.  1999,  Benhamou  2002,  and  references  therein). 

For  gradient-based  simulation  optimization,  further  discussion  can  be  found  in  the  papers  of  Fu  (2002), 
Andradottir  (1998),  Fu  (1994a),  and  Jacobson  and  Schruben  (1989);  see  also  the  books  by  Rubinstein  and 
Shapiro  (1993),  Pflug  (1996),  Fu  and  Hu  (1997),  and  Spall  (2003),  as  well  as  Fu  (2001b).  Both  of  the  well- 
known  simulation  textbooks  by  Law  and  Kelton  (2000)  and  Banks  et  al.  (2000)  also  devote  sections  to  the 
topic,  but  the  latter  does  not  discuss  gradient  estimation,  per  se.  Applications  to  the  single-server  queue 
example  considered  here  include  the  theoretical  convergence  results  of  Fu  (1990),  Chong  and  Ramadge  (1992), 
and  L’Ecuyer  and  Glynn  (1994),  and  the  in-depth  numerical  comparisons  of  L’Ecuyer,  Girouz,  and  Glynn 
(1994)  and  Andradottir  (1998).  Andradottir  (1996)  considers  the  more  general  setting  of  using  the  LR/SF 
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estimators  in  SA  algorithms,  and  Tang,  L’Ecuyer,  and  Chen  (1999)  analyze  the  asymptotic  efficiency  of  an 
averaging  version  of  SA  using  perturbation  analysis  estimators.  Work  on  sample  path  optimization  (called  the 
stochastic  counterpart  method  in  Rubinstein  and  Shapiro  1993)  includes  Plambeck  et  al.  (1996),  Robinson 
(1996),  Giirkan,  Ozge,  and  Robinson  (1999),  Homem-de-Mello,  Shapiro,  and  Spearman  (1999);  Dussault  et 
al.  (1997)  combines  the  approach  with  stochastic  approximation.  See  also  the  chapter  by  Shapiro  (2003)  on 
using  Monte  Carlo  methods  for  stochastic  programming.  The  original  papers  on  stochastic  approximation  are 
Robbins  and  Monro  (1951)  and  Kiefer  and  Wolfowitz  (1952).  For  a  more  recent  general  book  on  stochastic 
approximation,  see  Kushner  and  Yin  (1997);  L’Ecuyer  and  Yin  (1998)  discusses  convergence  rates  as  a 
function  of  computational  budget.  Spall  (2000)  provides  both  indirect  and  direct  gradient-based  SA  methods 
for  obtaining  near-optimal  or  optimal  convergence  rates  via  stochastic  analogues  to  the  deterministic  Newton- 
Raphson  algorithm,  using  Hessian  matrix  estimation.  SPSA  was  introduced  by  Spall  (1992),  although  the 
random  directions  method  was  proposed  in  Kushner  and  Clark  (1978);  see  http://www.jhuapl.edu/SPSA/ 
for  an  extensive  annotated  bibliography.  Application  to  the  settings  of  this  handbook  is  described  in  Fu 
and  Hill  (1997),  and  using  deterministic  sequences  for  SPSA  is  considered  in  Bhatnagar  et  al.  (2003)  and 
Xiong,  Wang,  and  Fu  (2002).  Andradottir  (1998)  contains  further  useful  discussion  on  the  application  of 
SA  algorithms  in  simulation  optimization,  especially  regarding  the  issues  of  choosing  the  gain  sequence 
and  modifying  the  algorithm  in  cases  where  the  objective  function  is  not  well-behaved.  Most  stochastic 
approximation  algorithms  involve  known  constraints;  Bashyam  and  Fu  (1998)  consider  the  case  of  a  noisy 
constraint  that  must  also  be  estimated,  in  particular  a  service  level  constraint  for  an  (s,  S)  inventory  system. 
Studies  on  choosing  the  finite  difference  used  in  the  Kiefer- Wolfowitz  version  of  SA  include  Zazanis  and  Suri 
(1993),  and  L’Ecuyer  and  Perron  (1994). 

For  perturbation  analysis,  the  books  by  Glasserman  (1991),  Ho  and  Cao  (1991),  and  Cao  (1994)  treat 
IPA  extensively  (in  particular,  see  Glasserman  1991  for  a  complete  treatment  of  the  commuting  condition; 
see  also  Cao  1985  and  Heidelberger  et  al.  1988)  and  SPA  to  some  extent,  whereas  the  book  by  Fu  and  Hu 
(1997)  is  a  comprehensive  treatment  of  SPA,  which  was  introduced  by  Gong  and  Ho  (1987)  and  Suri  and 
Zazanis  (1988);  see  also  Fu  and  Hu  (1992),  Fu  (2001a),  and  Fu  and  Hu  (1991,  1993),  where  higher  derivatives 
for  multi-server  queues  are  treated.  Many  of  the  examples  here  are  adopted  from  Fu  and  Hu  (1997).  More 
recent  work  connecting  IPA  with  Markov  decision  processes  using  the  idea  of  potentials,  which  also  relates  to 
the  LR/SF  method,  include  Cao  (2000,  2003ab).  The  field  of  perturbation  analysis  for  gradient  estimation 
includes  numerous  other  extensions  and  variations  on  IPA  not  discussed  here,  including  rare  perturbation 
analysis  (Bremaucl  and  Vazquez-Abad  1992);  structural  IPA  (Dai  and  Ho  1995);  discontinuous  perturbation 
analysis  (Shi  1996);  and  augmented  IPA  (Gaivoronski,  Shi,  and  Sreenivas  1992).  Seminal  papers  applying 
perturbation  analysis  for  estimating  the  effects  of  finite  perturbations  in  the  parameter  include  Ho  and  Li 
(1988),  Cassandras  and  Strickland  (1989),  and  Vakili  (1991). 

For  the  likelihood  ratio  or  score  function  method,  a  good  reference  is  the  book  by  Rubinstein  and  Shapiro 
(1993),  which  also  includes  some  discussion  of  IPA  (Chapter  5  on  the  “push  in”  method),  as  well  as  both 
first  and  second  derivative  estimators  for  most  of  the  entries  in  Table  1  (Section  2.2);  see  also  Reiman 
and  Weiss  (1989),  Glynn  (1990),  and  Andradottir  (1996).  The  “push  out”  method  for  handling  structural 
parameters  was  introduced  in  Rubinstein  (1992);  see  also  Section  2.5.4  in  Rubinstein  and  Shapiro  (1993). 
Using  conditional  expectation  to  reduce  variance  in  the  LR/SF  method  was  considered  in  McLeish  and 
Rollins  (1992);  see  also  Section  3.4  in  Rubinstein  and  Shapiro  (1993).  A  “unified”  view  of  IPA  and  LR/SF 
is  presented  in  L’Ecuyer  (1990),  which  allows  the  parameter  to  appear  in  both  the  sample  performance  and 
the  distribution  (see  also  L’Ecuyer  1995  for  further  discussion  and  some  technical  corrections). 

The  weak  derivative  method  was  introduced  by  Pflug  (1989,  1990,  1996).  More  recent  work  attempting 
to  unify  the  approach  with  others  such  as  rare  perturbation  analysis  and  smooth  perturbation  analysis, 
includes  Heidergott  and  Vazquez-Abad  (2000,  2001),  in  the  setting  of  general  Markov  processes,  which  may 
not  provide  as  convenient  a  framework  for  the  discrete-event  simulation  setting  as  generalized  semi-Markov 
processes  (GSMPs).  Derivations  for  many  of  the  entries  in  Table  1  can  be  found  in  Heidergott,  Pflug,  and 
Vazquez-Abad  (2003). 

10  Future  Research  Directions 

Stochastic  gradient  estimation  research  has  matured  over  the  last  decade  to  the  point  that  much  of  the 
analysis  has  become  theoretical  rather  than  algorithmic.  This  line  of  research  addresses  many  issues  that 
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also  arise  in  traditional  steady-state  output  analysis  -  only  now  for  stochastic  gradient  estimators  -  and  can 
involve  advanced  probabilistic  tools.  In  contrast,  two  possible  directions  for  future  research  described  briefly 
here  are  more  motivated  by  simulation  practice  and  have  much  algorithmic  room  left  in  them  to  grow. 

How  is  gradient  estimation  best  employed  in  simulation  optimization?  One  recent  development  is  the 
use  of  fluid  models  in  this  setting.  A  discrete-event  simulation  model  for  which  application  of  the  direct 
derivative  estimation  may  be  difficult  is  approximated  by  a  stochastic  fluid  model,  which  is  used  to  derive 
IPA  estimators  that  are  then  implemented  on  either  the  (simulated)  stochastic  fluid  model  or  sometimes 
on  the  original  stochastic  discrete-event  simulation  model.  In  the  former  case,  where  the  IPA  estimates  are 
generally  unbiased,  the  optimal  settings  often  provide  a  good  approximation  for  the  optimum  in  the  original 
model.  In  the  latter  case,  the  new  IPA  estimators  do  not  yield  unbiased  estimators  when  implemented  in 
the  discrete-event  model  (so  are  generally  not  useful  in  sensitivity  analysis,  per  se) ,  but  they  often  provide  a 
good  approximation  of  the  zero  gradient  location  when  used  in  optimization;  however,  the  theoretical  basis 
for  this  good  fortune  is  still  not  fully  understood.  Some  papers  in  this  area  include  Wardi  et  al.  (2002), 
Cassandras  et  al.  (2002),  Sun  et  al.  (2004),  and  Panayiotou,  Howell,  and  Fu  (2005). 

The  LR/SF  method  is  closely  related  to  importance  sampling.  Investigating  this  connection  more  thor¬ 
oughly  for  variance  reduction  purposes,  and  doing  so  in  a  more  general  stochastic  gradient  setting,  would  be 
of  benefit  to  simulationists.  Some  work  along  this  line  in  the  setting  of  stochastic  programming  is  contained 
in  Shapiro  and  Homem-de-Mello  (1998). 
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